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PREFACE 


This book has been written primarily for prospective teachers 
who want to know how mental tests сап be of kelp in their school 
work. It can serve also as a guide for teachers in service. The first 
seven chapters describe the va-ieties of mental tests and point out 
the usefulness and limitations of each sort. The last three chapters 
deal with the writing of objective items, with the construction 
of classseom tests, and with some of the ways in which mental 
tests can usefully be employed in guidance and counselling. 


a» chapter’ 2 covers in summary fashion the statistical terms and 


procedures most often used with mental tests. I do not believe it 
possible to describe mental tests intelligently without using rele- 
vant statistical terms. At the same time, I think that the classroom 
teacher need not be a’ psychometrician or testing specialist in 
order to use standard tests in the school. For those who want to 
go further into test construction, there is an Appendix which 
treats statistical method more fully. 

I have found it generally better to teach Chapter 2 before 
taking up a discussion of mental tests themselves—to use it, that 
is, as a preliminary to later chapters. Chapter 2 can then be re- 
ferred to specifically when he various statistical terms occur. 
This procedure has the advantage of reviewing the basic statis- 
tics when the need arises. 

I believe that the book will be found to contain ample material 
for one term’s work. This is especially true when the laboratory _ 
exercises and questions at the ends of the chapters are covered in 
class discussion, and when reports upon relevant literature are 
required. 


Henry.E. Garrett 
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СНАРТЕК 1 


MENTAL TESTS IN THE SCHOOLS 


LI 


The Teacher and Mental Tests 


The widespread use of standard tests in today's schools ren- 
ders it increasingly necessary for the classroom teacher to be 
familiar with these devices, with what they are and what they do. 
Teachers are often required to administer and score tests and 
frequently to use these scores in the evaluation of pupil capa- 
bilities and future promise. This is essential, of course, if the 
standard test is to have value in the work of the school. Most 
teachers, however, have no desire to become testing specialists 
or psychometricians, and many have little knowledge of modern 
statistical method. For these reasons, books dealing chiefly with 
the statistics of test construction and with other technical prob- 
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lems, while a necessary part of the rraining of school and clinica! 
psychologists, are ‘often not very useful to the teacher. In fact,” +» 
they may leave him more confused than enlightened. 

Tis ook is planned to present a comprehensive account of 
standard tests for teachers and for others not planning to becorne 
specialists in this field. It is not a book on statistical method, it 
does not deal broadly with the history or testing, nor with the 
applications of tests to problems of business and industry. Instead, 
it describes the various sorts cf test, their uses and abuses, and how 
they supplement and aid the work of the classroom. Statistical 
terms necessary to an understar.ding of the tests themselves are 
defined and illustrated, but detailed calculations are not included 
in the text. The book's usefulness will be enhanced if the exer- 
cises and topics at the ends of the chapters are carefully worked 
through. It is highly desirable, too, that the instructor have the 
class examine, take and score a number of tests. ‘The discussion 
in a chapter will be clarified when there is actual familiarity with 
the tests described. 


What Mental Tests Are 


In a mental test, the examinee is confronted with a variety of 
tasks—questions to bc answered, problems to be solved, direc- 
tions to be followed. Answers may be given orally, in writing, 
and sometimes by marking or manual manipulation, as, for 
example, by fitting blocks into apertures. Mental tests differ from 
physical tests, though there is considerable overlap in the two 
sorts of measurement. Both varieties of test require previous 
learning, and both present problems, but the mental test—to a 
greater degree than the physical—demands verbal abstraction 
rather than action, ideas rather than muscles. Tests of physical 
fitness—of height, weight, and physical strength, for example— 
differ most markedly from mental tests; in other words, are most 
physical. Tests which require speed and accuracy of hand-eye 
or hand-ear co-ordination, which demand manual dexterity and 
skill (called sensory-motor tests) are both mental and physical. 
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But none of these tests is as mental" a$ is the intelligence test 
or school examination in algebra or history, since none of them 
depends to so large a degrec upon verbal symbols. TE 

The term mental test is sometimes restricted to the measure- 
ment of intelligence or aptitude, examinations in school subjects 
being classified as educational achievement tests. The reasoning 
here is that the mental cest—the intelligence test, for example— 
tells us how much a chiid ca» learn, whereas the school examina- 
tion tells us what he has already learned. To some extent this is 
true. But the distinction between the two sorts of measurement 
is one of degree rather than of kind. No mental test measures 
potential ability exéept by way of performance. We possess no 
microscope by which we can discover the inherited qualitics of 
a child's brain or nervous system. The general intelligence test, 
to а greate? degree than the school examination, measures poten- 
tial ability because it draws more upon native alertness than upon 
routine school learning. But the school examination also draws 
upon native alertness as expressed in school learning, and both 
sorts of test demand the use of symbols—words, diagrams, 
numbers, pictures. Accordingly, in this book the term sental test 
will be used to describe both sorts of examination. 

The primary objective of a mental test is to detect individual 
differences—that is, to discover how one child compares or 
“stacks up” against another child of the same age, sex or grade 
classification. This knowledge, as we shall see later, is useful in 
many ways in school and out. „А second objective of the mental 
test is to discover ;ztra-individual differences or the variations 
in performance within an individual. The scores made by an 
examinee, when put in comparable units and represented on a 
profile, provide a useful record of the examince’s strengths and 


weaknesses. ў 


A Classification of Mental Tests 


In beginning the study of mental tests, it will bc helpful to 
draw up a list of the different varieties of tests. Most widely used 
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tests are standardized for procedute and results. A standardized 
educational achievement test, for example, is one that has been 
constructed in accordance with the best principles of test making 
and has been administered to hundreds of pupils in those grades 
for which the test is suitable. Results from standard tests are 
expressed as norms. These are typical scores earned by large 
groups of childrcn believed to be representative of various ages 
and grades. For example, a score of 45 on a standard reading test 
may be the norm for children 9 ycars, 6 months old; or for chil- 
dren who are just beginning the fourth grade. 


The following outline gives some notion of the ficld to be 
covered and at the same time furnishes ‘an overview of the 
1 

chapters ro follow. à 


o 
VARIETIES Os MENTAL TEST 
I. Intelligence Tests m 
(1) individual: administered to one examinee at a time 
(2) group: administered, like a school examination, t. 
examinees at the same time 
(3) pe.formance: make little or no use of language, in con- 
trast with the paper-and-pencil tests in (1) and (2) 
II. Educational Achievement Tests 
(1) survey: comprehensive examinations used to determine 
general academic standing 
(2) subject: examinations in specific fields-for example, 
physics, Spanish 
(3) diagnostic: cover a wide range of academic skills (in 
reading or arithmetic, for example) and are designed to 
reveal specific weakne ses and strengths 
III. Aptitude Tests 
(1) general: for example, of 
(а) mechanical ability 
(b) clerical ability 
(2) special: aptirude for school subjects—for example, chem. 
istry or foreign languages; differential aptitudes 
(3) professional: for example, in 
" (а) law 
(b) medicine 
(c) engineering 
(d teaching 


o many 
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(4) talent: aptitude in:such fields as 
(a) art 
(b) music ‘ 

IV. Tests of Various Aspects of Personslity S" ae 
(1) personal adjustment questionnaires: Surveys of worries, 
= fears, social inadequacics 
(2) attitude surveys: upon, for example, social, economic 


and political questions - i 
(3) inventories Of interests as related to various occupations. 
(4) environmental factors related to personality: question- 

naires covering socio-cconomic background and other 

variables › 
(5) projective techniques: subtle and indirect measures of 


dominant personality trends 
3 L 
E 


All these mental tests will be treated in subsequent chapters. 
The following sections of this chapter provide a brief outline of 
the development of psychological testing in order to clear the 
ground for later work. For a more complete discussion of the 
historical development of mental tests, the studert should consult 
references at the end of this chapter. 


The Beginnings of Mental Tests 


Interest in psychological testing developed in Germany and 
France about the middle of the last century. This interest grew 
out of the acute need for a better understanding of feebleminded- 
ness and the various forms of insanity. Tests were devised for 
the purpose of determining hat the feeble-minded person can 
learn, how much he can learn, and in what respects he differs 
most drastically from the normal. In the case of the insane and 
the mentally deteriorated, brief tests were drawn up for assessing 
loss of memory, distortions of perceprion, distractibility, mental 
fatigue, and changes in such sensory-motor function; as speed 
and accuracy of motor responses. 

In England, interest in mental testing arose from the study of 
individual differences in menal and physical functions. The 
leader in this movement was Sir Francis Galton, an eminent 
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geneticist, who set up a testing laboratory in London in 1882. 
Here, for a small fee,:a person could have the keenness of his 
vision and hearing tested, as well as his muscular strength and 
his spéed and co-ordination of response. Galton's tests were quite 
brief and sampled rather narrow aspects of behavior. In fact, 
they were sensory-motor rather tham strictly mental in char- 
acter. One of the first American psychologists to become inter- 
ested in mental testing was James McKeen Cattell. Cattell in- 
troduced mental tests of the Galton type in this country at the 
turn of the century. 


Intelligence Tests: Individual 


The individua! intelligence test as we know it today grew out 
of the work of Alfred Binet, a French psychologist, who was 
director of the laboratory for physiological psychology at the 
Sorbonne. In 1904 Binet was asked to devise a mental test suit- 
able for use in detecting slow learners in the schools of Paris. 
The test was to be used not only to sift out the subnormal chil- 
dren in the grades but also to provide a better 
degrees of feeblemindedness, with a vie 
education of these children, In 1905, Binet, with a collaborator, 
Theophile Simon brought out the first scale for measuring intel- 
ligence. This scale consisted of thirty problems and questions 
arranged in order from easy to hard. A second edition of Binet’s 
Scale appeared in 1908, and a third and final edition in 1911. 
These tests differed sharply from those of Galton. Binet was 
interested in determining the intellectual level o 


not (as was Galton) in studying differences 
in fairly narrow mental and sensory-motor fi 
to measure intelligence, Binet believed he m 
would measure a child’s memory, his compri 
ment, and his insight. He avoided questions which demanded 
specific and routine school learning. For example, instead of 
asking the examinee the product of 6 x 3 or the name of the 
largest city in France, Binet asked the child to repeat four digits 


understanding of 
w to improving the 


f school children, 
among individuals 
unctions. In order 
ust get tests which 
ehension and judg- 
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(single numbers) or the words of a sentence (heard only онсе); 
to tell the "thing td do" in specific problem situations; to 
criticize (“see through”) an absurd statement or fallacy; to give 
differences between, for instance, a president and aking; to 
define abstract words like justice and loyalty. 

Binet's famous tests’ became the basis for the widely used 
Stanford Revision of the'Binet-Simon Scale, described in Chapter 
3. To Binet belongs the credit for having set up the first “age 
scale"—that is, a test series in which items ‘are arranged or 
grouped by age levels. A child’s “score” on an age scale is deter- 
mined by the level attained and is expressed by a mental age 
(MA), which dezotes the child's maturity. 

Children of preschool age are unable to do: tests whic requirc 
reading and word knowledge. For these childrer, therefore, as 
well as for children handicapped in specch, vision, or hearing, 
and for the non-English speaking, performance tests must be 
used. In a typical performance test, the child is asked to identify 
common objects, string beads, build towers of blocks; or he may 
be asked to fit blocks into cutouts, arrange pictures in sequence, 
match the colors of cubes. Performance tests have been devised 
for use with illiterate and less intelligent adults as well as with 
children. 


Intelligence Tests: Group 


When intelligence tests are administered to large groups of 
examinees at the same time, they are appropriately called “group 
tests.” The first group tests were developed (in 1917) during 
World War I. Together with other information, these tests 
were used (1) in accepting or rejecting men, (2) in the classifica- 
tion of those accepted, (3) in the assignment of draftecs to 
various types of service, and (4) in determining admission of 
candidates to officer training schools. There were two kinds of 
group test, called Army Alpha and Army Beta. The. first was 
intended for soldiers who could read and write; it required that 
an examinee follow fairly involved directions, solve “mental 


' 
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arithmetic” problems, know the meanings of words, and perceive 
relations (for example, in an analogies test the question might 
be as follows. Hand isto foot as wove is to 7 [3 ` ). Army Beta 
was a, ndt-language or ‘non-verbal test. It made use only of 
diagrams, pictures, and numbers and was answered by a simple 
system of marking. Army Beta was admiriistered to the illiterate 
and the foreign-born. Directions were given in pantomime for 
the benefit of those soldiers who did ne: understand English. 

During World War II, а group intelligence test called the 
Army General Classification Test (AGCT) was administered to 
some 12,000,000 men. AGCT is a verbal or language test. It 
includes three sorts of materials: verbal (vocabulary), numerical 
(arithmev’< probleiis), and spatial (for example, problems in 
spatial relating presented by pictures of block piles to be 
"counted" by the examinee). No specific “school” questions 
were asked since the test was designed to measure mental alert- 
ness in dealing with symbolic materials apart from specific train- 
ing. Both Alpha and AGCT arc still used in the testing of adults. 

Between Workl Wars I and 1I, scores of group tests.of intel- 
ligence were constructed and used widely in the schools and col- 
leges. These and other mental tests (aptitude, personality) have 
been widely employed in business and industry as an aid in the 
sclection and placement of personnel. 

In most group intelligence examinations, items are answered 
by marking one of several possible solutions (multiple-choice), 
by selecting one of two answers (truc-false), and by checking ог 
underlining the appropriate reply among several options. These 
answer techniques are called "objective" (p. 185), because in 
scoring such tests the judgment of the examiner docs not enter 


in—or does so to a very slight degree. Group tests of intelligence 
are treated in Chapter 4. » 


Educational. Achievement Tests 


Since World War I, a number of tests of educational achieve- 
ment have been construeted on objective principles. These te 


Sts 
are used to determine general educational levt 


1 or standing, as 
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well as knowledge of a given subject field—as, for example, 
gcometry or French. The general survey test, when used in the 
clementary school, is a comprehensive examination of the stu- 
dent’s knowledge of reading, spelling, arithmetic, grammar and 
literature, history and elementary science. Tests in separate 
subjects—history or pl.ysics, for example—-are also available at 
educational levels from the secondary school to college. Eduza- 
tional achievement tests are called diagnostic when they are used 
to reveal a student’s weaknesses in a particular area such as 
arithmetic or reading. Diagnostic tests must of necessity cover a 
wide range of information and skills in a given subject. Educa- 
tional achievement tests are described in Chapter 5. 


Aptitude Tests 


Tésts designed to discover whether a student is “gifted” in 
music or mathematics, say, or whether a young man has the 
knack for dealing with tools and mechanical..czntrivances are 
called aptitude tests. Aptitude may be inferred (1) from the 
degree of mastery attained in a “new” subject after a period of 
study. Aptitude for a foreign language, for instance, is demon- 
strated in the case with which the subject (Spanish, for example) 
is acquired after a term's work. Achievement tests, given after a 
period of "exposure," reveal this aptitude directly. Aptitude is 
also inferred before a period of study by testing (2) to see 
whether an examinee possesses those abilities and skills judged 
to make for success in a given subject (for example, physics), 
or in a profession (for example, medicine or law). Aptitude fo. 
physics is gauged by finding how well the student has learned 
the mathematics necessary for work in physics; aptitude for law 
is judged by the student's ability to read difficult prose:«compre- 
hend fairly involved legal arguments and follow a line of 
reasoning to a conclusion. What are called "differential aptitude 
tests" are designed to assess a student's strengths and weaknesses 
in certain fundamental abilities believed to.be crucial in a 
number of activit.es—in and out of school. 

Tests of general mechanical aptitude sample performance in a 
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number of activities believed to demonstrate mechanical knowl- 
edge and skill. Factors measured by these tests include familiarity 
with tools, insight into mechanical relations (pulleys, levers, and 
the like), ability to solve problems expressed in diagrams of 
machines and mechanical contrivances, and interest in mechan- 
ical things, as shown by the reading of popular science, building 
radios, tinkering with cars and so'on. Manipulative tasks and 
mechanical gadgets have been employed to test for special 
abilities in a variety of situations.-Among the traits studied arc 
manual dexterity, sensory-motor skills, visual and auditory 
acuity, all of which are needed in many jobs in industry and in 
the armed forces. , 

Clerical apsirude tests cover the knowledge and skills needed 
in a business office. Tests under this head provide scores from 
which we can predict an examinee’s ability to carzy out the 
written work of an office—to spell, check records, read and 
write easily and accurately. 

Aptitude testrmof a special sort have been devised for inferring 
talent in art and music. In music, for instance, many of the 
factors needed for success can be measured: "ear" for music, 
rapid and accurate reading of music at sight, and knowledge of 
harmony and other technical phases of music. In art, "taste" for 
color, form, symmetry and other artistic dimensions are deter- 
mined by comparing a student's judgments with those of 
acknowledged experts. Knowing whether a person possesses 
talent in art or music is, often highly important in educational 
and vocational guidance. 

Aptitude tests are treated in Chapter 6. 


Personality Tests 


Psycnologists have used the questionnaire or inventory to de- 
termine personality factors in threc areas: (a) personal adjust- 
ment, (b) attitudes, and (c) interests. In addition, questionnaires 
have been used in the social Sciences to survey socio-economic, 
home, and cormmunity phenomena. “Tests”. of personality are 
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in reality standard interviews designed to reveal characteristic 
ways of behaving. The personal adjustment questionnaire or 
personal data sheet inquires into a person's fearr, worries, 
anxieties, and home and work adjustments. Such inventories are 
often appropriately called “trouble sheets.” In some cases, the 
questions are direct and undisguised: "Are you afraid of high 
places?” “Can you stand the sight of blood?” “Do your parents 
treat you right?” In other adjustment inventories, questions are 
disguised and indirect, so that the intent of the question may not 
be understood by the examinee. A technique often used in such 
inventories is that of “forced choices” (p. 168). 

Attitude questionnaires attempt to reveal systematic ways of 
behaving or thinking about social, religious, or rolitical matters. 
Can a student be classified as narrow- or broad-minded, religious 
ог irreligióus, or somewhere between these extremes? Attitude 
inventories try to answer these questions. 

Interest inventories survey a person's interests:in books, sports, 
people, occupations, social activities, and the “ke. An examinee's 
pattern of interests may serve to identify him with some well- 
defined occupational group—for example, lawyers or chemists. 
Or a young man’s interests may identify him with some area of 
interests, such as science, business, or social service. Interest tests 
are especially valtiable in counseling, since interest, as much as 
ability, may determine a student's educational or vocational 
choices. 

Another group of personality tests makes use of what has been 
called “projective” techniques. Projective tests are disguised 
interviews in which an examinee is asked what he "sees" in some 
neutral situation—an ink blot or a picture, for example. These 
tests are perhaps most useful in the diagnosis of disturbed mental 
states. They must be administered by an expert and are employed 
mostly by psychiatrists and clinical psychologists in severe be- 
havior problems. 

The techniques of the personality questionnaire have been 
widely used in nolls conducted to assess puhlic opinion about 
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such things as political issues and social questions. Inventories 
have been employed, too, to survey systematically the association 
between iterhs in a constellation of attitudes or opinions—for 
instance, between social “and economic background factors, 
preferences for political candidates, etc., and views about social 
and economic issues. In sociological studies“ in which environ- 
mental factors loom large, the kind of home from which a child 
comes, the educational and occupational scatus of the parents, 
and the character of the coramunity may be revealed by a 
systematic survey of background variables. Personality tests are 
treated in Chapter 7. 


How Mental Tests Are Used in the Schools 


tu 

As we have said, the primary function of the mental test is to 
reveal individual differences. More specifically mental tests are 
useful to the teacher in three ways. First, mental tests aid in 
the evaluation of class performance in relation to established 
norms (p. 115). Second, tests reveal the strengths and weaknesses 
of individual pupils, that is, are useful in educational diagnosis 
(p. 116). Finally, tests enable the teacher to discover whether 
a pupil | possesses aptitude for a given subject or course of study, 
and to predict his probable success in college or professional 


school. We shall consider these three objectives in the chapters 
to follow. 


SUGGESTIONS FOR FURTHER READING 


Comprehensive accounts of the development of mental testing and of 
the application of tests in various areas will be found in the references 
below. 
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Freeman. Е. S. Theory and Practice of Psychological Testing. (Rev. 
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Ross, C. C., and Stanley, J. C. Measurement in Today's Schools. (3rd 
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Thorndike, К. L., and Hagen, Elizabeth. Measurement and Evaluation 
in Psychology and Education. New York: Wiley, 1955. 
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| CHAPTER 2 


STATISTICS IN MENTAL TESTING 


The purpose of this chapter is to acquaint the prospective user 
of mental tests with those statistical terms and techniques most 
often used in testing. Stress throughout the chapter is on the 
meaning and significance of symbols and terms rather than on 
the mechanics of computation. For the latter, the student should 
consult the Appendix as well as the books on statistical method 
listed at the end of this chapter. 

Perhaps the best advice one can offer the teacher whe is pian- 
ning to use mental tests is that he first take a course in statistics. 
For students who have been wise enough to do so, the present 
treatment will corstitute simply?^a brief review and summary. 
And for those who have had no statistical training, it will pro- 
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vide the minimum essentials for thé understanding and evalua- 
tion of mental tests themselves. 


[3 


THE FREQUENCY DISTRIBUTION 


Drawing Up a Frequency Distribution , 


Suppose that a teacher has administered a test of English 
grammar to fifty children in the seventh grade. The papers have 
been marked and the names and scores of the children recorded. 
Two questions ordinarily arise: (1) What is the typical per- 
formance of the class, and (2) What is the pinge of talent in 
the class? To answor these questions, we may organize and pre- 
sent the fifty scores in one of several w ays. 

Table 2-1 is a systematic tabulation of the fifty English 
grammar scores into what is called a frequency distrivutiqn. 

The fifty scores have been arranged from high to low into 
sets of five under the heading "Scores." In the frequency column 
headed “f” are fisted the numbers of scores which fall into each 
sub-group. For example, five children score in the interval 60-64, 
eight in the interval 55-59, and so on down to four who score 
in the bottom interval, 30-34. 


A test score is always taken to represent the distance along 


TABLE 2-1 


Frequency Distribution of Fifty Scores 
On a Test of English Grammar 


Scores 
` 60 — 64 
55 — 59 
50 — 54 


40 - 4 
35 - 39 
30 - 34 


# 
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8 

' 10 

- 45-49 12 
6 

5 

4 

50 


A 


i c 


"M 
: "dq 
Graphic Representation of the Frequency Distribution 15 


some scale of ability renning from low to high. Thus, a score 
of 46 covers the span from 45.5 to 46.5,°46.0 itself being the 
middle of the score interval. Other scores have the same meaning: 
in each case the score covers the distance .5 anit below to .5 
unit above the face value,of the given score. This definition of a 
score means, of course, rhat the interval 30-34 begins at 29.5 
and ends at 34.5, that interval-35-39 begins at 34.5 and ends at 
39.5, and so on. For convenience in writing, che intervals in 
"Table 2-1 are the score limits rather than the exact limits. In each 
case, however, the exact limits of the intervals are understood. 


Graphic Representation of the Frequency Distribution 


A frequency distribution may be represented „graphically by 
a frequency polygon, as shown in Figure 2-1. In the construction 
FIGURE 2-1 Frequency Polygon of Fifty Scores Acbieved by 
Seventb-Grade Children on a Test of English Grammar 
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polygon, scores are laid cff along the baseline, ог 
ls, and the frequencies (f's) are pivoted 
on the vertical or Y-axis. Each f is plotted directly above the 
midpoint of the interval upon which it falls. The four scores 
falling in the first grouping, 30-34, are plotted above 32, the 
midpoint of the interval. In the other intervals (reading up), 5 


of a frequency 
X-axis, at equal interva 


12 above 47, and so on. Тһе points are 


A. frequency polygon shows graphically how the Scores are 
spread over the test scale from low to high. From Figure 2-1 it 
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scores are plotted above 37, midpoint of 35-39, 6 above 42, | 
4 

is apparent that more children scored in the middle of the scale | 


Appendix. | | 
Another way of representing a frequency distribution graph- ў 

Я к ] Й " 

ically is the bistogranr. Figure 2-2 represents the f's on the Score | 


FIGURE 2-2 , Histogram of Fifty Scores Achieved by 


с Seventh- 
Grade Pupils on a Test of English Grammar, 
е, 


-y- 
(Frequencies) 


Scores) 


intervals by small rectangles set up over each interval, For the 
first interval, the rectangle is four Y-units high, and for the 
second interval five Y-units high, and so on. The highest rec- 
tarz'^, 12 units on the Y-axis, is above interval 45-49, 

rs 

The histogram and frequency polygon represent the same 
facts, and there is little to choose between them. Frequency 
polygons are to be preferred to histograms when two distribu- 


tions are plotted on the same axes, since in the histogram the 


ed 
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vertical and horizontal lines often coincide, making the figures 
difficult to disentangle. 


The Normal Curve 


The symmetrical bell-shaped graph shown in Figure 2-3 is the 
well-known normal curve. This “ideal” frequency polygon is 


FIGURE 2-3 Тре Normal Curve 
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the mathematical model to which many distributions of actual 
scores approximate. (See, for example, Figure 2-1.) The normal 
curve is often called the zorz;l probability curve because it 
shows the probability of occurrence of scores of different size, 
when these are determined by a large number of independent 
and randomly combined factors. 

The normal curve has played an important role in the develop- 
ment of mental measurement. Among its uses in testing may be 
mentioned the following: 


1. Selecting the Items of a Test. When rhe distribution of test 
scores for a class is badly off-center or “skewed,” as shown in 
Figures 2-4 and 2-5, the test is not suitable for the group. In 


М г 
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Figure 2-4 the test is too easy—there are too many high scores; 
and in Figure 2-5 the.test is too hard—there is a disproportionate 
number of low scóres. When the test maker takes the normal 


FIGURE 2-4 Negatively Skewed Curve 


Mdn. 


Mean 


Low 'High 


FIGURE 2-5 Positively Skewed Curve 
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curve as his model, questions and problems are carefully selected 
and their scoring adjusted to give a symmetrical arrangement of 
test scores like that of the anormal curve: This means that a 


majority of papils score at the middle of „Че scale, a smaller 
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number scoring at the high and low ends. Note that, according 
to the criterion of normality, the frequency polygon in Figure 
2-1 shows the English grammar test to be generaily satisfactory, 
though perhaps a bit too easy. р j 


2. Scaling the Obtained Scores from a Test. Raw or obtained 
scores from a test are usualiy expressed by an arbitrary number 
of points. Scores of this sort do ‘not represent equal steps or equal 
units along some ability scale; and since there is' no zero point, 
a score of 40 is not twice as good as a score of 20. When point 
scores are transformed into deviations from the average or mean, 
and expressed in units of the standard deviation (page 36) of 
the group, they are called sigza-scores. The unit of deviation 
(the standerd deviation) is usually represented by the Greek 
letter © (sigma). Sigma-scores may later be converted into 
standaré'scores (page 38). Many educational achievement and 
aptitude tests publish norms (page 40) in terms of standard 
scores. These scores are comparable from test to test when dis- 
tributions are normal, or approximately so. ur 

Point scores may be changed over directly into equal-unit 
scores in a normal distribution. Such “normalized” scores have 
several advantages (page 40). 


3. Determining the Stability of a Test Score. An obtained score 
on a test—for example, a group test IQ—can be expected to 
vary somewhat up or down when the test is administered a 
second time. The variation to be expected in a score, that is, its 

robable stability, can be predicted from tables of the normal 


probability curve (page 23). 


AVERAGES. 


After a frequency distribution has been tabulated, we are 


ready to compute a typical measure or average. There are three 
sorts of averages—also called measures of central tendency—in 


common use. 
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The Mean (M) L 


1 
Given a set of ten Scores, 10, 9, 10, 12, 8, 6, 4, 7, 5, and 4, the 
mean 33 simply 7.5—foand by adding the scores (75) and dividing 
this sum by their number (10). The M is popularly called the 
average. When scores have been grouped into a frequency dis- 
tribution, as shown in Table 2-1 on pago 14, а slightly different 
method is employed in finding tlie M. {See Appendix.) But the 


M is always essentially the sum of the scores divided by their 
number. 


The Median (Mdn) e 


When scores are arranged in order of size, anorher sort of 
average, the zzedian (Мап) is the point in the distribution found | 
by counting off one-half of the scores from either". of the 
series. We usually start with the low end. For example, for the 
five scores, 7, 8, 9, 10, and 12, the median or mid-score is 9: there 
are two scores above and two below it. When the number of 
Scores is even—for example, 5, 7, 8, 9, 10, and 12—the median 
is midway between the two middlemost scores, namely, at 8.5. 
There is no mid-score. When scores аге grouped into a frequency 
distribution, as shown in Table 2-1 on page 14, the median is | 
still the 50 per cent point—the point found by counting 50 per и) 
cent of the way into the distribution. For a method of computing 


the median, see the Appendix. я | 


The Mode 


That score in a set of scores which occurs most frequently is | 
called the crude mode, or the modal score. The crude mode is a | 
thi? rort of average. In Table 2-1 the crude mode is taken at 47, 
midpoint of the interval which contains the largest frequency. 
The mode can be computed more exactly, but usually we simply » | 
take the most often recurring:score as the crude mode without 
further refinement. In most cases, the mode is a preliminary 
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measure of central tendency. For exploratory purposes it does 
not need to be computed so precisely as the mean or median. 


MEASURES OF VARIABILITY 3. 


The Range 


It is sometimes more important to know the variability of a 
set of scores than to know the mean or median.” Suppose, for 
example, that two sections of Grade 7 have the same mean but 
differ markedly in spread of talent, as evidenced in the variability 
of scores around the mean. Figure 2-6 shows two distributions 


FIGURE 2-6 Two Distributions with tbe Same Mean but 
Differing Markedly in Range (Variability) 


of this sort: the scores in A range from 40 to 60, whereas the 
scores in B range from 20 to 80. The difference between the hich 
and low scores in the A distribution is 20 points; in the B distribu- 
tion 60 points. The range is the most general index of variability. 
Other more exact measures are the standard deviation (written 
as SD ого) and the quartil lation (written as О). ` 
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The Standard Deviation (c) 


The mean of the set of five scores—12, 10, 8, 6, and 4—15 8, 4 


If 8 is subtracted from each score, we have 12 — 8 = 4, 10 — 8 = 
2,8 — 8 = 0,6 — 8 = —2, and 4 — 8 = —4. The size of a 
deviation from М tells the extent to which the individual score 
deviates from the common mean; and the sigz of the deviation 
indicates its direction from M. If tid deviation i is now squared, 
we have 4? — 16, 2? — 4, (— 2)? — 4, and (— 4)? — 16. The 
square of 0 is, of course, 0. The sum of these squared deviations 
is 40, and c, the standard deviation, is defined as 


б - (чеши 


ог, іп our example, c =\/40/5 = V8 ~ 
= 2.83% awe 


Squaring the separate deviations around the М eliminates the 
minus signs and gives extra weight to extreme deviations. A SD 
or c is judged to be large or small (to reflect much or little varia- 
tion) in relation to other SD's computed for the same test. For 
example, if 35 boys and 42 girls have the same M on a history test 
but the boys’ o is 10 and the girls’ o is 6, we know that the boys’ 
scores spread more than the girls’ up and down the scale—in 
both directions from the mean. 

In a normal curve, с provides valuable information concern- 
ing the way in which the separate measures fall around the 
common mean. In Figure 2-7, for example, 30 is seen to include 
virtually all the measures above the М, and —3c all of the meas- 
ures below the M. The total area of the normal curve is taken 
as N. From tables of the area of the normal curve, we know that 
between М and 10 are approximately 34 per cent of the measures 
(actually 34.13 per cent); and between М and —1с are also 
34 per cent of the measures. The two "halves" of the curve are 


equal. Hence we find about 68 per cent of the measures—roughly ` 


* For calculation of с from a frequency distribution, see Appendix. 
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FIGURE 2-7 Areas Under tbe Normal Curve 
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two thirds between М and +10. Furthermore, from tables we 
find that 14 per cent of the measures fall between 10 and 20 in 
the normal curve and about 2 per cent between ?% and 30, The 
same proportions hold, of course, for the half of the curve to 
the left of the M, since the M divides the area of the normal 
curve into two equal parts. 

The relations of с to the total area (N) in the normal curve 
model hold pproximately for distributions which, resemble the 
normal curve in form. An illustration will make clear how the 
normal curve model is used in such cases. Suppose that on a 
reading test administered to sixty children in the fifth grade, 
the M = 62 and o = 8; and suppose further that the frequency 
polygon of these scores closely resembles the normal curve in 
form. Taking the normal curve as our model, then, we can say 
that approximately two thirds of the scores (that is, forty) fall 
between 54 and 70 (62+ 8). Moreover, about 14 per ~vit of 
the scores, or about 8, will fall between 70 and 78 (between 
lo and 2c), and about 2 pericent or 1 or 2 will fall between 
78 and 86 (that is; between 27 and 37 ). In the lower half of 
the distribution, 14 per cent or 8 scores, will £all between 54 
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and 46, and 2 per cent between 46 and 38. These relationships 
are shown in Figure 2-8. Note that M, the reference point, 
is 62 and that © is 8 units on the test scale. 


ә v 
FIGURE 2-8 Use of Normal Curve Model to Show Distribution 
of Sixty Scores on a Reading Test 


The Quartile Deviation (Q) 


Just as we compute the median by counting off 50 per cent 
of the scores, so we can count off 25 per cent of the scores 
from the low end of the distribution (that is, 25 per cent of Ww 
№) to locate Qu, the first quartile point. Similarly, we can count 
off 75 per cent of the scores from the low end of the distribution 
to locate Qs, the third quartile. The gap between Qs and Qi 
is called the interquartile range, or range of the middle 50 per 
cent. Q, the semi-interquartile range, is computed thus: . 


_ Q3-Q1 
2 2 
Like c, Q is a measure of variability but, unlike ©, it is found 


by counting into the distribution, whereas © is computed from » 
the squared deviations taken around the M. When the median 
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is the measure of central. tendency, we generally use Q; when 
М is the measure of central tendency, we use c. 2 

Methods of computing О from a frequency distribution will 
be found in the Appendix; here we are concerned primarily 
with the meaning of О as a measure of variability. Q’s useful- 
ness will become clearer when we have computed the percentile 
curve, or ogive, as shown in the next section. 


PERCENTILES AND PERCENTILE RANK 


Table 2-2 shows the frequency distribution of Table 2-1, 
E with the addition o* two columns in which the f's have been 
cumulated. : 


TABLE 2-2 


Freeney Distribution and Cumulated Frequencies of 
Fifty Scores on an English Grammar Test 
Data are the fifty scores in Table 2-1. 


—————Y —M 


а) (2) (3) (4) 

Scores f cum.f % cum.f 
60 — 64 50 100 
55 = 59 45 90 


In column (3), scores have been added progressively—cumu- 
lated—from the bottom to the top of thc distribution. On the 
first interval, 4 is the entry, 4 + 5 on the next interval gives 9; 
9 + 6 on the third interval gives 15; and so on. In column (4), 
these cumulated scores are expressed as percentages of N. In 
Figure 2-9, cumulated f’s, in percentages have been plotted 
against the score-intervals laid off along the baseline. As scores 
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FIGURE 2-9 Ogive or Cumulative Frequency Curve 
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rre added over each interval, each сит. is plotted just above 
he wpper limit of the interval upon which it falls. The resulting 
-shaped curve is called an ogive, or cumulative frequency 
raph. The ogive constricts or expands the scale of scores into 
scale of one hundred points, called a percentile scale. The 


nedian and the Q's can be read from the ogive almost as accu-- 


ately as they can be computed from a frequency distribution. 
Го illustrate, if a line is run frora the 50 per cent point on the 
(scale across to the curve, a perpendicular dropped from this 
oint to the score-scale locates Мал at 49 approximately. (The 
omputed value is 48.66.) The twenty-fifth percentile, or Q1, 
located from the ogive at 42 approximately; and the seventy- 
fth-z^rcentile, or Q3, at about 55. Other percentile points (for 
xample, Pss or Pos) can be located in the same manner by going 
rom the appropriate point on the vertical percentage scale 


cross to the ogive and dropping a perpendicular to the base- 
ne. Note that the distance from Q3 to O1 (that is, 55-42) is 


t 
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the interquartile range:or range of the “middle 50.” One-half 
of this distance is 13/2, or 6.5, which is the quartile deviation, 
or О. The larger the О of a distribution, the greater the spread 
of the middle 50 per cent of scores along the scale and the larger 
the variability. 

A pupil’s percentile rank (PR) is the position on the per- 
centile scale (on a scale of спе hundred points) to which his 
score entitles him. Suppose that Tom Brown achieves a score 
of 40 on our English grammar test. What is his PR? Going out 
to 40 on the score-scale on the baseline, up to the curve, and then 
across to the Y-scale, we locate Tom’s PR at about 20. This 
PR tells us at once that about 20 per cent of the pupils scored 
lower than Tom. If Mary Green scores 58 on the grammar test, 
her PR is read at approximately 84—and 84 per cent of the 
class made, lower scores than she did. Scores achieved on tests 
expressed in different units—for example, a reading test and an 
arithmetic test—cannot be compared directly. But relative posi- 
tions (PR's) of a child in his classes can be quicl!y determined 
and compared when both sets of scores have been converted 
into a common percentile scale. Moreover, several PR's may be 
combined to give a general index. 


° CORRELATION* 


The relationship between two sets of test scores can be de- 
scribed mathematicaliy by the coefficient of correlation between 
them. Correlation is expressed by a decimal fraction (called r), 
which may vary along a scale from .00 to 1.00. Let us suppose 
that tests in English grammar and in history have been admin- 
istered to the same seventh-grade class. Suppose further that. 
children who score high in the English test tend to scor- nigh 
in history, and that children scoring fairly high or quite low 
in English tend to score fairly high or quite low in history. When 
this happens, the coefficient of correlation between the two 


* Sce Appendix for computation of a correlation coefficient. 
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sets of scores will be marked or substantial, for example, .60 
to .70. Now suppose that most pupils who score high in English 
grammar score only average in arithmetic. The correlation be- 
tween these two areas would then be lower—perhaps no more 
than from .20 to .30. If those pupils who score high in English 
grammar tend to earn very low scores on a test in shop work, 
the correlation here would be close to zéro, or perhaps negative. 

Positive coefficients cf correlation тїп from .00 to +1.00; 
good scores in the one test go with good scores in the other. 
Negative coefficients of correlation run from .00 to —1.00— 
denote inverse relationship—and good scores in the first test go 
with poor scores in the second. Zero corielation denotes just 
no correlation between two variables. 

Whether a correlation coefficient is to be regarded as high or 
low depends upon a number of factors. The, correlation of 
height with weight in school children is generally liigii—around 
.70 for a given age level. The correlation of a good intelligence 
test and school grades will fall typically between .50 and ‚70; 
and the correlation between personality traits (from question- 
naires) and school achievement are usually low and often nega- 
tive. The following table will aid in interpreting coefficients of 
correlation: 


T'sfrom .00to+ .20 very low; negligible 
т from +.20 to + .40 low; present but slight 
r’s from +.40 to + .70 substantial or marked 
rs from +.70 to +1.00 > high to very high 


When computing the correlation between two forms of the 
same test (the self-correlation of the test), we demand much 
higher 7’s than are found typically between different variables, 


The Reliability of а Test 


The Reliability Coefficient. One important application of cor- 
relation to mental testing is in the determination of the reliability 
of a test. Test reliability refers to the stability of test scores, 
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If a child achieves a scoré of 48, for example, on a highly reliable 
test of general science, subsequent scores earned by this pupil 
upon equivalent forms of this test should-not differ greatly from 
the initial score of 48. But if the test is unreliable, in repeated 
testing the score may vary widely from its first determination. 

The reliability coefficient, of a test is found by computing the 
self-correlation of the tgst. Suppose that a reading examination 
has been given to five sixth-grade classes and that two weeks 
later the same test or an equivalent form is administered to the 
same classes. If the correlation between these two administra- 
tions of the test is high (a reliability coefficient of .90 or more 
is considered high), we may feel confident tkat scores earned 
by pupils in this class are reasonably accurate measures of "true" 
ability. 

Test reliability is sometimes determined by repeating a test 
and correlating the second set of scores against the first set. This 
method is followed when there is only one form of the test. 
More often, an equivalent or parallel form of the: test is given, 
and the reliability coefficient is the 7 between the test and its 
alternate form. The reliability coefficients of many standard edu- 
cational tests have been determined in this way. Other ways of 
determining test reliability will be found in the references at the 
end of the chapter. The authors of standard tests will usually 
specify what method has been used in computing the reliability 
of their tests. 


The Siandard Error of a Score 


The accuracy or precision of an individual score is perhaps 
best expressed by the standard error of a score, which is also 
called the standard error of measurement. The SE (standard 
error) is calculated from the following formula: 


SE (score) =o V/1 — ru 


Where © is the standard deviation ‘of the test scores and ли is 
the reliability coefficient of the test. Suppose that the o of a 
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set of test scores is 10 and the reliability coefficient (ri) is .95. 
Then the SE of a scoie on this test is SE = 10/1 — .95 or 22. 
"This may be interpreted to mean that should a child take this 
test a second time, the chances are good (about 7 in 10) that 
his “new” score will not diverge by more than + 2 points from 
the true determination. The SE of a Stanford-Binet IQ is 4.5 
points for IQ's from 90 to 110. Ip other Words, if the test is re- 
peated, we can, expect a child's IQ to stay within 4 to 5 points 
of its true value. E 

Reliability coefficients of standard intelligence and educa- 
tional achievement tests‘are generally above .90 for large groups 
of pupils. The size of the reliability coefficient depends upon 
several factors: tfie variability of the group, the length of the 
test, the method used in determining reliability. A reliability 
coefficient of .50 in a single grade or class may indicate as much 
stability of score as a reliability coefficient of .90 їй a large 
group. The great advantage of the SE of a score is that it takes 
account of both the reliability coefficient and the variability (SD) 
in the group. (See page 56.) 


The Validity of a Test 


A mental test is a valid testing device when it measures what 
it claims to measure. Tests are not valid for all areas and all 
situations, but are valid in certain defined situations and for 
certain behaviors. A group intelligence test, for example, is not 
a valid measure of emotional control or of delinquent behavior. 
Validity may be classified, for convenience, into three sorts: 
experimental, content, and predictive. The validity of an intelli- 
gence test is determined experimentally by computing the test’s 

_ correlation with various criteria: school grades, ratings for mental 
alertness, and other measures of intellect, to mention a few. 
Many of the best tests of general intelligence have been vali- 
dated against the Stanford-Binet, the best known individual 
intelligence test (page 47). Aptitude tests—for example, those 
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of clerical and mechanigal aptitudes—are validated against dem- 
onstrated proficiency in office work or in mechanical tasks. Inde- 
pendent measures against which tests are validated afẹ called 
criteria. Criteria do not represent entirely adequate for com- 
pletely sufficient determinations of a trait. Insofar as criteria 
incorporate valuable aspects of the behavior we are studying, 
however, they represent vdriaples with which;a test, to be valid, 
must correlate positively. 3 

Tests of educational achievement in history, mathematics, 
languages, and the like possess content validity in that test 
questions sample the subject matter areas directly. Content valid- 
ity is not alone a sufficient index of a test's usefulness. Such 
considerations as choice of items, extent of sampling, form in 
which iteris are put, and level of difficulty are also very im- 
portant. Bur content validity is a necessary first step. Intelligence 
and aptitude tests possess content validity insofar as the items 
in them fulfill the author's definition of what he is measuring. 
Such asserted or “face” validity, however, is never as convincing 
as is the content validity of the educational achievement test. 
Generally, tests of intelligence and of aptitude must depend for 
their validity upon correlations with independent criteria judged 
to be dependable indices of the trait under study. 

Predictive validity is the degree to which a test battery is related 
to some criterion of future performance or measure of success 
which will become available in the future. The predictive validity 
of a good group intelligence test for school performance ranges 
from about .40 to .60. (See page 96.) Many short tests have low 
correlations with a criterion, but when put with other tests 
into a team combine forces to raise the correlation of the battery 
with the criterion. Validity coefficients do not run as high as do 
reliability coefficients, since no test can correlate higher with 
other tests than with measures of itself. 

Personality questionnaires, interest blanks, and attitude scales 
have content validity insofar as „choice of items is concerned. 
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Such instruments are usually validate] experimentally against 
objective expressions. of interest, indices of neurotic behavior, 
and thé like. 


Practical Considerations ia the Choice of a Test 


There are a number of factors which enter into the choice 
of'a mental test besides validity and geliability. Some of the 
more important are the following: 


1, Appearance: Is the test format good—are the items attrac- 
tively presented and arranged? 

2. Administration: How much time is required to give and 
score the tes? What is the cost? 

3. Manual: Does the author give full accounts of reliability and 
validity—how found, upon what samples, of what sorts? 
Are instructions clear? Bi 

4. Norms: Are the test norms readily interpreted? Are age and 
grade equivalents given? What type of scaling is used? 


SCALING TEST SCORES 


The purpose of scaling is (1) to revamp the raw test scores 
into a scale of equal units, and (2) to enable us to combine 
sub-tests into a single index. It is sometimes important (especially 
with aptitude tests) to compare relative performances, and this 
can be done only when tesis are expressed in equal units. The 
score on a test when expressed simply in number of items done 
correctly is an aggregate of arbitrary points. Pupils can be 
ranked in order of merit for such aggregates, but such “scores” 
do not constitute a scale. There are several methods for scaling 
raw scores, 


The Age Scale 


When scores are put into MA (mental age) units, they form an 
age scale. Mental age is the chronological age which corresponds 
vo or is typical of a given test score. The MA of 9-6, for example, 
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represents the performace of the average child who is 9 years 
and 6 months old. Thus if Dick achieves.an MA of 9-6 on the 
Stanford-Binet, this mental age is a measure of his intellectual 
status or degree of mental growth. г 

If Dick’s life age (CA) is 10-4, his IQ is 92. IQ = МА/СА 
and in our example is 114 mos/124 mos; the decimal is dropped. 
IQ is a measure of a child's brightness relative to that of other 
children of his age. Wh&n MA and CA are equal, the IQ is 100, 
the brightness index of the average child. Dick's IQ of 92 means 
that he is somewhat less bright than the typical child of his age 
level. IQ's above 100 are achieved by bright children—those 
whose mental growth runs ahead of their years. IQ's below 100 
indicate that a child is below normal, and very low IQ's (70 or 
below) imply feeblemindedness. 

The age-scale is used in most individual intelligence scales and 
by many group tests of general intelligence. The MA and IQ 
were first widely used to measure performance on the Stanford- 
Binet test, which was constructed so as to meet the requirements 
necessary to yield a constant ratio score, or IQ. Many group 
tests do not meet these requirements. It is wise, therefore, to 
accept IQ's from group tests as tentative indices of brightness 
not always closely related to IQ from the Stanford-Binet. 


The Percentile Scale 

We have already seen (page 25) how obtained scores can be 
fitted into a scale of one hundred units to yield a percentile 
scale. The PR (percentile rank) of a score—its position on the 
percentile scale—can be computed from the frequency distribu- 
tion of scores. But the simplest plan is to plot an ogivé (sec Figure 
2-9, page 26) and read the PR from the graph. The PR of 
any score then becomes the percentage of the distribution which 
lies below the score. This method is not accurate beyond the 
first decimal, but it is sufficiently precise for many purposcs. It 
is easy to apply and requires a minimum of calculation. Table 2-3 
gives the frequency distribution of 180 scores on a clerical 
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aptitude test earned by students cnrolléd in several courses in a 
business college. 


TABLE 2-3 


Frequency Distribution of 180 Scores Achieved on a 
Clerical Aptitude Test 


PR's of 


Scores Midpoints ' % cum.f midpoints 


194 = 196 ° 195 
191 = 193 192 
188 — 190 
185 — 187 
182 — 184 


179 — 181 
176 — 178 
173 - 175 
170 - 172 


The орїуе in Figure 2-10 has been constructed from the 
% cum. f's in Table 2-3 following the method outlined on page 
26. In the last column are entered the PR's of the midpoints of the 
successive score-intervals. The midpoints reading down are 195, 
192, 189, 186, 183, 180, 177, 174, and 171. Any student who 
earns a score of 191, 192, or 193—-who falls, that is, in the interval 
next to the top—receives a PR of 95, the PR of the midpoint of 
this interval. These midpoint PR’s constitute norms for the test. 
PR’s can be read with considerable accuracy from the ogive. 

PR's have several advantages over raw scores. Suppose that 
a child has taken tests in arithmetic, science, English, history, and 
spelling. If his PR's in each of these tests are known, they can 
be represented comparatively on a profile as shown in Figure 
3:13. 

"This graph permits a comparison of the child's achievement in 
the five subjects. It is clear that he is satisfactory in arithmetic 
(PR — 60) and science (PR — 55), above average in history 
(PR = 60), average in English (PR = 50), and below aver- 
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FIGURE 2-10 "n Frequency Polygon (Ogive) of 180 
Scores Achieved on a Clerical Aptitude Test 2 
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age in spelling (PR = 45). Comparison; of this sort cannot be 
made from raw scores. One disadvantage of the PR scale is the 
fact that‘ units are not equal at the extremes of the scale. When 
PR's are below 20 or above 80, they must be compared (or 
combined) with caution (see refs.). 


Sigiaa-scores and Standard Scores 


We have seca that one way of converting raw scores into 
a scale is by means of percentile ranks. Another method of 
scaling is to express the deviation of each test score trom the 
common mean in units of SD, thus putting all scores into o-units. 
Such “deviation” scores are called o-scores and sometimes z~ 
scores. The following is an illustration of the method of con- 
verting obtained scores into o-scores. 

Table 2-4 gives the M’s.and o’s earned by fifty sixth grade 
pupils on five objective educational achievement tests. At the 
bottom of the table are listed the scores achieved by two children, 
Mary and Howard. 


TABLE 2-4 


M's and o's Earned on Five Objective Tests of Educational 
Achievement Given in the Sixth Grade 


(1) Arith. (2) Arith. 
Reas. Comp. (3) Reading (4) Grammar (5) Science 
Mean 62 124 43 28 46 
с 10 20 a 7 4 8 
Mary's scores 57 119 50 31 36 
Howard's scores 62 144 41 26 49 


From an inspection of these scores, it is clear that Mary is 
below the class mean in arithmetical reasoning, arithmetical com- 
putation, and science, but is above the mean in reading and gram- 
mar. Howard, on the other hand, is exactly on the mean in arith- 
metical reasoning, above the mean in computation and science, 
and slightly below the mean in reading and grammar. These com- 
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parisons are useful, but beceuse of differences in the units in which 
test scores are expressed, we cannot (1) compare Mary's and 
Howard's scores in the several tests, except to point oar that 
they are above or below the mean, nor (2) combine either pupil's 
scores into a single meaningful index of academic achievement. 
Conversion of test scores into c-units will permit us to carry 
out both these operations, 
The formula for a o-scoze or z-score is 
(X — M) x 
دږ‎ ора 
c 


where (X — M) = xi Mary earned a score of 57 in arithmetical 
reasoning. This score deviates —$ points from the mean 
(57 — 62 = --5). If we divide this deviation of —5 by 10 (the 
c), we have —.50 as Mary's o-score in arithmetical reasoning. 
In Test 2, arithmetical computation, Mary's o-score is (119 — 
124)/20 or —.25. Her other o-scores are computed in the same 
way; those that are plus are above the mean, those minus below 
the mean. Mary's five o-scores are shown below: 


Test: (1) (2) (3) (4) (5) 
Магу? o-scores —.50 —.25 1.00 158 —1.25 


Howard's o-scores аге found as were Mary's. In Test 1, his 
Score of 62 is exactly on the mean, and his o-score is .00. In 
"Test 2, arithmetical computation, Howard's a-score is (144 — 
124)/20 or 1.00. Howard's scores are below the mean in tests 
3 and 4, and his o-scores are minus. His scores are tabulated 
below: 

Test (1) Q) (3) (4) G) 
Howard's o-scores .00 1.00 —.29 —.50 38 


It is apparent that o-scores are simply plus or minus deviations 
from the test mean expressed in o units. A practical disadvantage 
« Of such scores is the fact that they are small decimal fractions 
and are about as often + as — For greater convenience, there- 
fore, o-scores are usually converted into a distribution of 
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standard scores with an assigned М aad с. M’s and o’s often 
selected are M = 100, с = 20, M = 500, с = 100, M = 10, EZ 
o=" 


T bd , . 
If Mary's and Howard's scores are converted into a standard 


score distribution with Af = 100 and с = 20, we have the 
following: 
Tests (1) (2) (3) (4 (5) Total Mean 
Mary’s standa.d 4 
Scores 90 '95 120 115 75 495 99 
Howard’s standard 
scores 100 120 94. 90 108 512 102 


In the first test, Магу'з o-score is —.50, or —.50 of с below the Ў 
M. In our new distribution (М = 100, с = 20), the equivalent + 
standard score is one-half of c below 100, or 90, In Test 2, 
Mary's standard score is М of с below the mean ot 100 or 95 

(4 of 20 is 5). A formula for converting obtained scores directly 

into standard scores with a М = 100 and a с = 20 is the 
following: 


20 
X' = — (X — M) + 100 
c 


in which X’ = standard score in the “new” distribution 
X = original ог raw score » 
M = mean of the raw score distribution 
100 and 20 are the M and o of the new distribution 
с = 50 of the original or raw scores 


Substituting for Mary’s raw score of 50 in Test 3, we have 
X’ = 20/7 (50 — 43) + 100 
= 120 
Howard’s standard scores in the new distribution are found 
from the same formula. In Test 4, for example, Howard’s 
obtained score is 26 and from the formula we have re 
X’ = 20/4 (26 — 28) + 190 
= —10 + 100 or 90 
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The formula will convert any pupil's raw scores into standard 
scores when the М of the standard score distribution is 100 and 
c is 20. a 

When put in standard score form, Mary's ard Howard's scores 
can be compared directly; and the five scores of each child can be 
combined with equal weights. On the five tests, Mary's average 
is 99 and Howard's 102. 


The 10 as a Standard Score 

When raw scores are converted into standard scores in a dis- 
tribution with a mean of 100 and a c of 15, these new Scores 
are often called "deviation IQ's" (page 36).,In the Wechsler- 
Bellevue Intelligence Test, for example, IQ’s are determined by 
this method. А Wechsler-Bellevue IQ of 115 is 1c above thc 
mean of the group; an IQ of 85 is —1e below the mean of 
the group. 

A. general formula for transforming obtained scores into 
standard scores with any given mean and o is 


X —T (xm + 


where 
X' — standard score in new distribution 
X = obtained score (usually in points) 
о” = SD of the new distribution 
с = SD of obtained score distribution 
M' = M of standard score distribution 
M = М of raw score distribution 


This formula may be used to compute deviation IQ's. Suppose 
that Arthur J., a veteran 32 years old, earns a score of 86 on an 
intelligence test for which the mean of his age-group is 80 and 
the o is 10. What is Arthur J.’s deviation IQ? Substituting in the 
formula, we have 

X’ =.15/10 (86 — 80) + 100 
= 109 
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The formula is useful when we wish to convert the sub-tests of 
а battery into comparable unns which may be combined into a 
single score. 


Normalized Standard Scores, or T-Scores 


When raw point scores are transformed into PR’s and the 
resulting PR's are converted into eqnivalent “scores” in a normal 
distribution, the final scores are said to bé “normalized.” If the 
normal scaling distribution into which the scores are converted 
has an М = 100 and с = 10, the normalized scores аге called 
T-scores. Converting raw scores into T-scores can be easily 
done with the aid of tables prepared for this purpose. First, the 


т + + . B 
PR’s of the scores (ог of the midpoints of the successive inter- А 


vals) are read from an оріхе. The T-scores (normalized scores) 
corresponding to these PR's are then read from tables. T-scores 
range theoretically from 0 to 100, practically from about 15 to 
85. 'The method of computing T-scores for a given distribution 
will be found in detail in the references. 

For several reasons, T-scaling is theoretically the soundest 
method of converting raw scores into an equal-unit scale. Many 
of the widely used educational achievement tests make use of 
some variety of T-scaling. T-scores can be added or averaged; 


they have the same meaning and denote the same relative 
achievement. 


NORMS 


Norms are scores which are typical or characteristic of pupils 
of a given age or grade. To provide comparable norms, scores 
on group tests are expressed in PR's, standard scores, or nor- 
malized scores. Performance tests and individual intelligence 
scales have norms expressed in МА and IQ's. Many group 
intelligence tests also have their raw scores put into MA and 
IQ terms. Such MA's and IQ's are rarely comparable to the 
MA's and IQ's of the Stanford-Binet. 

Educational achievement tests usually provide both age and 


É 
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grade norms. From a ‘table of norms, a teacher can tell whether 
her class is up to grade level, and she can tell how individual 
pupils in her class stand relative to each other on the sub-tests 
of the battery. Suppose that Carl W., age 11-2, and just entering 
the sixth grade, earns a score of 18 on an arithmetical problems 
test of the Metropolitan Achievement Test. From the table of 
norms we find that Cart has a PR of 68 on the test. Further- 
more, we find that his age-equivalent is 12-4 and his grade equiv 
alent is 6.9. Carl’s score is typical of children about a year older 
than he, and his knowledge of arithmetic equals that of children 
who are completing the sixth grade. His PR, of course; reflects 
performance above the average. 

The SRA (Science Research Associates) verbal and non- 
verbal tests are group tests of intelligence. Norms are given in 
PR's and IQ's. If a child achieves a score of 34 on the verbal 
section, for example, his PR from the table of norms is 40 and 
his IQ (really a standard score) is given as 96. The Stanford 
Achievement Test provides age and grade equivalents to obtained 
scores. Raw scores from nine sub-tests are converted into an 
equal-unit scale, in accordance with which a profile is drawn up 
(page 35). Suppose that Louise M., age 12 years and 6 months 
and in the last quarter of the seventh grade, earns a raw score of 
40 on the science test of the battery. From tables of norms we 
find that this score has a grade equivalent of 8.3 and an age 
equivalent of 13-4. Thus Louise’s score in science places her 
above her age and grade levels. Her PR on the science test is 60. 

Many aptitude tests supply scaled score norms for various 
groups of workers differing in experience, training, and skill. 
Interest inventories are scored so as to reflect an applicant’s in- 
terests in a large number of occupations. Thus if the vocational- 
interest blank is scored with the key for lawyer interests, we can 
tell whether the applicant has the interests of a lawyer and to 
what extent. Scores from personality questionnaires serve to iden- 
tify a subject as “dominant,” “introverted,” or “neurotic” in 
relation to the norms given for these classifications. 
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Teachers use test norms for a number of, purposes, which will 
be elaborated upon in later chapters. Among the more important 
objectives, we may list the following. 

1. To estimate group achievement. Performance of the class as 
a whole can be evaluated against national, state, or local 
norms (page 115). 

2. To evaluate individual achievement; À pupil's score on an 

» educational achievement test is always considered in connec- 
tion with his native capacity or mental alertness. A slow or 


dull child may be working up to his limit, whereas a bright | 


child may be performing below expectation. 

3. To evaluate family and cultural background. The achieve- 
ment of a class or of an individual will always depend on his 
socio-economic status, family background, and Opportunities, 

4. To evaluate the curriculum effects. A pupil’s achievement 
must be judged as good or poor in the light of the content, 
emphasis, and objectives of the school. 

5. To measure individual differences. There are always wide 
differences in academic achievement within a group or class, 
These differences are due in part to differences in native 


ability and in part to differences in environihental oppor- 
tunities. 
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QUESTIONS FOR DISCUSSION 


1. A fifty-item multiple-choice test in science, administered to ninety 
'upils, showed. scores ranging from 16 to 48. Fifty scores fell between 


j 
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35 and 48. What would the distribution be like? Would it be skewed? 
What measure of central tendency would be most suitable? ' 
2. In question 1, what would you conclude about the suitability of the 
test for the group? 
3. Explain the implications of cach of the following correlation 
coefficients: 
(a) The correlation between height and score. on an arithmetiz test 
is .04. v 
(b) Ratings of pupils for social adjustment and aggressiveness show a 
correlation of —.65. 
(c) The correlation between term grades and scores on a group 
intelligence test is .70. г x 
4. Rank the following 23 scores in order of size: 35, 40, 31,29% 35, 23, 
32, 34, 28, 34, 15, 14, 34, 40, 22, 32, 30, 39, 50, 19, 40; 27, 37. Compare the 
"mid-score" with the mean. 
5. Karl’s PR on a biology test is 48. What does this mean? 
6. Margaret has taken five tests. What would be the advantage of 
expressing her scores on these tests as PR’s? 
7. Given the following: 


Parargaph Reading Atithmetic 
Mean 51.7 385 
с 9.2 6.5 


William achieves a score of 56 оп thc first test and 35 on the second. 
Convert these raw scores into z-scores. 

8. How are age and grade norms obtained? Which is the more useful 
in determining placement? 

9. Two classes earn about the same mcan on a test, but Class A's SD is 
twice the size of Class B's. What do vou conclude from this fact? 

10. How would you validate a teacher-made test? 


СНАРТЕЕ 3 


INDIVIDUAL INTELLIGENCE SCALES 


This chapter will consider four individual intelligence scales or 
test batteries.* These are (1) the Scanford-Binet** (1937 or re- 
vised form) designed for children from age 2 through adolescence; 
(2) the Wechsler Bellevue Intelligence Scale, for use primarily 
with adults; (3) the Wechsler-Intelligence Scale for Children 
(WISC); and (4) the Arthur Performance Scale, useful from 
about age 4 to maturity. These four scales—one for adults and 
three for children—are representative of the best individual in- 
telligence scales now available. They are carefully constructed, 


: * A test battery is а group of carefully selected tests designed to Operate as a 
team. 


°° The full name is Stanford Revision of the Binet-Simon Scale. 
44 
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widely used, valid and dependable. Ordinarily, the individual 
intelligence test will not be administered by the classroom 
teacher. But the teacher must be familiar with the make-up of 
these scales and with their role in the school program if he is to 
make good use of the test findings. 

The individual intelligence examination should not be admin- 
istered Бу a novice. To give such a test—and more important te 
interpret it—requires special training in mental measurement 
and in clinical psychology, plus a sound knowledge of psycho- 
logical theory. In addition, at least six months should be,spent in 
giving and scorir.g these tests under supervision, if one is to have 
a minimum of "clinical experience." Unfortunately, perhaps, 
directions and materials for giving the individual scales are 
readily available in the manuals, and the beginner is tempted to 
try his hand at administering the tests. Much undeserved criticism 
of the individual intelligence test—and of the MA and IQ—has 
arisen frorn the faulty administration and interpretation of these 
scales by the unskilled amateur. 


The Concept of General Intelligence 


Before examining the individual intelligence scales in detail, 
we must get a clearer notion of what the tests are attempting to 
measure. This means that we must formulate a definition of what 
is meant by "general intelligence." 

Definitions of general intelligence have run the gamut from 
such comprehensive biological descriptions as adjustment to the 
environment to the fairly narrow designation of aptitude for 
academic work. The French psychologist Alfred Binet defined 
intelligence as (1) the ability to take and maintain a definite 
direction—that is, to carry through a course of action once 
begun; (2) adaptability to new situations and new requirements; 
and (3) the power to evaluate and criticize one’s own acts (not 
present in the feebleminded). Other psychologists agreeing in 


‘the main with Binet have stressed adjustment to life and capacity 
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to learn. In contrast with these broad formulations, Lewis M. 
Terman, author of the Stanford-Binet, has defined intelligence 
simply es the ability to carry on abstract thinking. 

Definitions of general intelligence must of necessity be broad 
when they stress biological adaptability to life. Such definitions 
are hardly incorrect, but neither are they useful. Indeed, any 
atterapt to encompass such a comprehensive function as general 

` adaptability is a well-nigh impossible task. On the other hand, a 
definition of intelligence simply as the ability to do school work 
is certainly too narrow; we should include proficiency in every- 
day aciivities in business and the professions, where aptitude 
displayed in school finds ready application. 

In order to give greater precision to the concept of intelligence, 
the educational psychologist Edward L. Thorndike has suggested 
that we recognize at least three broad areas of incelligent be- 
havior. These “intelligences” he called abstract, mechanical, and 
social. Abstract intelligence he defined as the “ability to under- 
stand and manage ideas and symbols, such as words, numbers, 
chemical or physical formulas, legal decisions, scientific prin- 
ciples and the like. . . ." In the case of students, this is very close 
to what is called scholastic aptitude. Mechanical intelligence in- 
cludes "the ability to learn, to understand and manage things and 
mechanisms, such as a knife, a gun, а mowing machine, an auto- 
mobile, а boat, а lathe. . . .” Social intelligence is “the ability to 
understand and manage men and women, boys and girls, to act 
wisely in human relations.” We should expect to find high ab- 
stract intelligence in scholars, scientists, executives in business 
and government; high mechanical intelligence in mechanics, 
builders, expert carpenters and plumbers; and high social in- 
telligence in politicians, salespeople, ‘leaders in society. Presum- 
ably the successful civil engineer possesses high abstract as well 
as high mechanical intelligence; the successful criminal lawyer 
abstract as well as social intelligence; the machinery salesman 

mechanical and social intelligence. "These "intelligences" are 

positively, but not always highly, correlated. Hence, a high level 
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of one "intelligence" may accompany a fairly low degree of 
another. A. nuclear physicist (high in abstract intelligence) may 
be socially inept. And the man successful in business or pulitics 
may be mediocre in mechanical skills. Perhaps che able jack-of- 
all trades can be expected to rate well, but not necessarily very 
high, in all three areas. К 

On examining the individual intelligence test, we find that it 
presents a variety of problems which demand the avility to utilize 
ideas and symbols, for example, words, numbers, diagrams, 
pictures, geometrical figures. When used with young children, 
general intelligence tests are primarily measures of mental alert- 
ness on the abstract level. For adults, these tests are measures of 
the aptitude for such occupational and other tasks as draw upon 
abilities operative in school work. In short, the individual intel- 
ligence test measures abstract or scholastic ability primarily and 
is rarely a gauge of mechanical aptitude or of social competence. 
The evidence for this view comes from an analysis of the tests 
themselves, as well as from many studies in which individual 
intelligence tests have been used. 


THE STANFORD-BINET INTELLIGENCE 
SCALE (1937 REVISION) 


Because of the time required to administer the Stanford-Binet 
(in most cases forty minutes to an hour) and the training de- 
manded of the examiner, this test is rarely given routinely in 
most schools. The classroom teacher must be generally familiar 
with the Stanford-Binet, however, in order to know what can 
be expected of it—that is, how it might add to her knowledge 
of a given pupil. The Stanford-Binet is a valuable supplement to 
a group intelligence test or to an educational achievement ex- 
amination when (1) a child has a severe reading disability or 
some physical handicap (for example, in sight, hearing, or mus- 
cular co-ordination) ; (2) when a pupil exhibits marked emotional 
stress or emotional disturbance; and (3) when other test results 
or school marks do not jibe with the teacher's estimate of the 
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pupil's ability. For purposes of routine classification and place- € | 
ment, tle group intelligence test is about as satisfactory as the x 
Stanford-Binet, but the latter will provide a more accurate, de- 

tailed, and comprehensive appraisal of intellectual level and is 

more useful in diagnosis and prediction. 


‘Description. Thé 1937 edition of the Stanford-Binet represents | 
a carcful and tliorough re-working of the earlier 1916 scale. The | 
number of test items was increased from 90 to 129 and the scale 


: TABLE 3-1 Illustrative Tests from Stanford-Binet Scale 
Year IV ы 
1. Picture Vocauulary Child must recognize and name everyday 
objects seen in the pictures. 5 
2. Naming Objects Child is shown small toys representing com- 
from Memory mon objects. These he names, or they are 
named for him. Later he must recall from 
memory the name of cach object. 
3. Picture Completion Child must finish the incompleted drawing 


of a man. 
^. Pictorial Identifi- ^ Pictures of objects on cards to be identified. 
cation 
5. Discrimination of Recognition and identification of simple 
Forms geometrical forms. 
6. Comprehension Sensible answers to "why" questions. 
Alternate: Memory Repetition of short sentences read aloud to 
for Sentences the child. 
Year X Р 
1. Vocabulary The examinee must give definitions of 


eleven words in a standard vocabulary list. 
. 2. Picture Absurdities Must recognize what is "foolish" in a prc- 


il sented picture. 
3. Reading and Reads a selection and reports from memory 
Report what is read. 


4. Finding Reasons Gives sensible reasons to explain cause-and- 
effect relations in familiar situations. 


5. Word Naming Names as many words as he can in one 
minute: a measure of word fluency. 
6. Repeating Six The lists are read aloud at the rate of about 


Digits d one a second. 
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x extended down to lower age levels and much strengthened at 


upper age levels. Two equivalent forms of the scale, called L and 
M, were constructed. Table 3-1 contains a selection of the items 
at different age levels. Note that at the lower agé levels, such as 
IV, the test situations make use of objects and pictures and require 
that the child understand and carry out oral directions. At the 
upper age levels, X, XIV, and Average Adult, the test items are 
more abstract and bookish; the problems require verbal and 
numerical manipulation, reasoning, logical selection and choice 


for Years IV, X, XIV and Average Adult - BU 

Year XIV i 

1. Vocabulary Larger vocabulary required tian at year X. 

2. Induction Tests ability to grasp and apply a general 

rule. 

3. Picture Absürdities Must recognize what is "foolish" in a pic- 
III ture; more difficult than at Year X. 

4. Ingenuity "Tests ability to solve problems mentally. 

5. Orientation: Direc- Must be able to solve problems involving 
tion I space relations by following fairly com- 


plex directions. 
6. Abstract Words II Must define words like “loyalty” and 


“justice.” 
Average Adult 
1. Vocabulary Larger vocabulary than at Year XIV. 
2. Codes Must learn two codes and write messages 
in them. 
3. Differences Tests ability to generalize; makes use of 
Between fairly difficult concepts. 
Abstract Words 
4. Arithmetical Requires solution of mental arithmetic 
Reasoning problems. 
5. Proverbs Interpretation of proverbs and fables. 
6. Ingenuity Solution of problems requiring “mental 
manipulation.” 


7. Memory for Sen- Tests ability to reproduce rather long and 
tences involved sentences heard once. 
8. Reconciliation of Must tell how words denoting opposite 
Opposites states are alike. Tests ability to grasp ab- 
stract relations. 3 
ss 
ay 
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and good judgment. Memory for numbers and for sentences „ | 
recursrthroughout the scale. Questions dealing with specific facts 
learned in and out of school are excluded, but many common- 
knowledge questions are included on the reasonable assumption 
that what a person has learned in everyday living is a good index | 
of what he can learn—and will learn —later on. Some Stanford- 
Binet test materials are shown in Fignre 3-1 (facing page 54). 


Scope. The placement ‘of test items at a given age level was 
made to depend on the responses of one hundred children at each 
age ievel below 6, two hundred children at ages 6 to 14 inclusive, 
and one hundred children at ages 15 through 18. In all, about 

~= 3,000 children constituted the standardization group. , 

Terman and his co-workers selected children whose parents 
constituted a good cross section of occupational levels in the 
United States for the year 1930. The Stanford-Binet, like the 
original Binet Scale, is an age-scale. (See page 32.) It begins at 
2 years and items are grouped at one-half year intervals (at 2, 
215,3, 35, 4, 4%) up to 5 years. Mental growth at the lower 
age levels is so rapid that the authors of the scale thought it wise 
to narrow the gaps between age levels over this range. From 5 
years to 14, test items are grouped by year intervals; and beyond 
14 there is an average adult level and threc superior adult levels, 
The Stanford-Binet is most useful over the age range from about 
6 to 14—that is, over the elementary grades. 


Scoring. The Stanford-Binet assigns a mental age (MA) to a 
child in accordance with his ability to progress up the age scale. | 
As shown in the examples on page 51, two children may earn the 
same MA on the Stanford-Binet in different ways. š 

James, who is 9-3 oz 111 months old, earns an MA of 8-10, or 
106 months, by scattering his answers up the scale from age 
VII to age XIII. Robert also earns an MA of 106 months, but 
does not scatter as much as James. MA is a measure of mental 
maturity or status. Children differ in the way in which they 
answer the гес items, but by and large a child comes out with an 
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Test Record of James Brown, chronological age 9-3, 


Ра ог 111 months* 
Tests Passed Months Credit Total »£redit 
Year Level and Failed Per Test Year Month 
VII all passed 7 
УШ four passed 2 months 8 
1 IX three passed 2 months 6 
X two passed * 2 months à 4* 3 
NI one passed u 2 months 2 
XH one passed 2 months 2 
- XII all failed 0 
7 32 


MA = 8-12 or 106 months 
M James’ IQ = MA/CA x 100 = 106/111 x 100 — 95 


Test Record of Robert Green, chronological age 8-4, 
or 100 months 


н * The expression 9-3 means 9 years and 3 months. 
| Tests Passed Months Credit Total Credit 


is Year Level and Failed Per Test Year Month 
УП all passed 7 
УШ five passed 2 months 10 
IX four passed . 2 months 8 
X two passed 2 months 4 
XI all failed s 0 
7 22 
ү “ MA = 8-10 or 106 months 


Robert’s IQ = 106/100 = 106 (decimal dropped) 


MA. which indicates his ability to perform mental-manipulative 
< tasks like those of the Scale. 

The intelligence quotient, or IQ, is found by dividing the 
child's MA by his CA (chronological age) and is a measure of 
brightness or dullness. James has an IQ of 106/111, or 95, and 
Robert who is 11 months vounger, has an IQ of 106/100, or 106. 
Both boys have the same mental maturity, but Robert is brighter 
than James because he has reached the maturity level of 8-10 
at an carlier age. The two measures, МА and IQ, are comple- 
mentary, each providing distinctive information. A child of 8 


€ 
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and a man of 40 may each earn an MA of 8 years on the Stanford- 
Binet (be of the same mental status in terms of the tests). But ў 
the child has an IQ of 100 (8/8) and is normal, whereas the man 
is feebleminded. with an IQ of approximately 53. (Read from the 
tables in the Manual.) * 

The IQ is a developmental ratio which inevitably loses its 
value as a child:grows older and niental maturity is approached. 
"There is little difference in mean performance on the Stanford- 
Binet at ages 15, 18, and-20, and a correction table is provided 
in the Manual which adjusts the CA divisors in order to make the 
old.* person's IQ comparable to that of the child. There is no 
specific age at which intelligence can be'said to “mature” or 
reach its peak, but 15 is taken somewhat arbitrarily to be the MA 
of the average adult on Stanford-Binet. For any person over 16, 
therefore, the corrected divisor is 15 years. Tne highest MA 
which can be earned on the Stanford-Binet by passing all of the 
tests in the Scale is 2274. years. This MA yields a maximum IQ 
for adults of 152—found by dividing 273 months by 180 months 
(that is, 15 years). 


STANFORD-BINET IN THE SCHOOLS 


The evaluation of pupils from their school grades or from sub- 
jective impressions of cleverness or brightness is often quite“ 
misleading. A teacher may describe a conscientious, amiable girl 
of ten who is one year overage for grade as “bright” when her 
IQ turns out to be relatively modest. Contrariwise, a rude, in- 
attentive youngster may be rated as "about average" or cven 
below average when his IQ is in reality considerably above 
normal Judgments of intelligence are always influenced by 
personality traits and social behaviors. It is not strange, there- 
fore, to find that two pupils must in general differ by as much 
as twenty points of IQ before a teacher is forced to lay other _, 


* See Terman, L. M., and Merri!l, M. A. Measuring Intelligence. New York: 
Houghton, Mifflin Co., 1937. Tables, pp. 45-450. 
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criteria aside and admit that the badly behaved youngster is 
brighter than the courteous, hardworking one. 

Teachers should know certain facts about the IQ, what it is 
and how best it can be used, in order to make maximum use of 
the information provided by the test. More specifically, the 
classroom teacher should know (1) the range of IQ's to be ex- 
pected in the school population, (2) the dependence to be placed 
on the IQ as a measure of intelligence, (3) to what extent the 
test has diagnostic value, and (4) the limitations of the IQ and 
the precautions to be observed in making interpretations based 
on it. These topics will be considered in the following sections. 


Range of IQ's in the School Population 


The frequency polygon in Figure 3-2 shows the distribution 
or spread of IQ's for the nearly 3,000 children from 2 to 18 


"years old who made up the standardization sample. The fre- 


quency polygon is close to the normal curve model (page 17). 
IQ's center at 100 and range about equally above and below this 


FIGURE 3-2 Distribution of IQ's on tbe Stanford-Binet Scale 
for Nearly 3,000 Cbildren, 2-18 Years Old 
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From Terman, Lewis M., and Merrill, Maud A., Measuring 
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value. The o of the IO distribution is about sixteen points 
(exactly 16.4). This means that the middle 2/3 of school chil- + 
dren will earn IQ's between 84 and 116. About 1/6 of the 
children will have ІО” above 116 and 1/6 will have IQ's below 
84. See Figure 2-7. The percentage of school children who can | 
be expected to occupy the different IQ levels may be summar- 

ized as follows: , E 


TABLE 3-2* 


Numbers of Children in the School Population to Be 
Expected at Various IQ Levels 


2 


ہے 


Percent of Children 


1Q Level Description in Each Category 
130 and above Superior or gifted e وک‎ 
110 — 129 Above average to high р 425 
90 – 109 Average or normal ы 45 — 50 
70 – 89 Low normal to dull 20 — 25 


Below 70 , Dull to feebleminded 2-3 


The number of children found in any group (especially in the 
two extreme groups) will vary somewhat with the social and 
economic conditions of the community and with the standards 
set up for defining the different intelligence levels, 

The IQ is useful in setting educational expectations, Suppose 
that William Butler, a fifth-grade pupil in a large school system, 
has a chronological age of 10-2 and a Stanford-Binet IQ of 116. 
William reads at fifth-grade level, is somewhat above average in 
his other subjects, and is excellent in arithmetic. He is a quiet, 
well-behaved boy who seldom becomes angry or annoyed. 
William makes friends readily and is accepted as a member of 
his group. What are Williams educational expectations? 

Table 3-3 will be of help in answering this question. William 
falls in the upper 16 per cent of school children. He should have 
no trouble completing elementary and high school. If he is in- | 
dustrious, emotionally stable, and has intellectual interests, 
William may be encouraged to go to college. It might be wise 


FIGURE 3-1 Test Materials Used in the Stdhford-Binet Scale 
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teproduced through the eourtesy of C. Н. Stoelting and McGraw-Hill. 


Ttems from the Performance Part of the Wechsler 


FIGURE 3-4 
Adult Intelligence Scale 
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to advise a college of not too high standards if William is lacking 
in self-confidence, and to enlist his parents' enthusiastic support. 

Mary, age 12-0 with an IQ of 83, presents a very different 
picture from that of William. Mary is doing barely passing work 
in the fifth grade, though she is about two years overage for the 
grade. Since her MA is not more than 10 years,* she is perhaps 
doing all we can reasonably expect of her. It would be manifestly 


* Since MA/CA = IQ, Mary's MA is 12 x .83, or about 10 years, 


TABLE 3-3: 
Educational Expectation in Relation to IQ Level 


ا 


IQ Level 
(Stanford-Binet) Educational Expectation - 
120 + Can do acceptable work in a first-class college if properly 


motivated. 


115 - 119 Should do acceptable but not outstanding college work. 
Would probably do best in a small college where the 
work is individual and standards not too high. 


105 – 114 Should complete high school, and may do well in the less 
difficult college courses. Will have trouble with science 
and mathematics. 


90 - 104 This group constitutes about 50% of the elementary 
school population. If not retarded by illness or other 
causes should complete the eighth grade on schedule. 
Some of these pupils will do fairly well ir high school. 


80 - 89 Usually one to two years over age for grade. Acceptable 
high school work very unlikely for IQ's below 90. A 
child of IQ 80 will compete the eighth grade—if at all— 
two-three years behind schedule. 


75 - 79 These children may reach the fifth grade. Will rarely 
go beyond unless given much individual attention. 


Below 75 If one of these children reaches the fifth grade һе will be 
14-15 years old. Unable to do fifth-grade work; but be- 
cause of chronological age is likely to be pushed ahead 
after repeating each grade two or three times. May be 
promoted because of agè far beyond his mental capacity. 
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unfair to scold Marv and insist that she "try harder." Mary's 
educational expectation (see Table 3-3) is no higher than the 
eighth grade, if that. 


Stability of the IQ 


"When a secorid form of the Stariford-Binet is administered to 
a child, this second IQ will often vary somewhat up or down 
from the first determination. Norman's IQ, for example, may 
be 109 today, whereas it was 112 six months ago, and may very 
well ‘be back to 106 six months hence. The stability of a test 
score when the, test is repeated or another form given, is called 
the reliability of the test (page 28). Stanford-Binet is one of 
our most dependable mental examinations, with réliability co- 
efficients which are usually well over .90 (page 29). Despite 
this fact, fluctuations in individual IO's can still be expected 
when the test is repeated or a second form administered. 

The reliability of a test is conveniently expressed by the stand- 
ard error (SE) of a test score (page 30). The SE gives the allow- 
able (onemight almost say the inevitable) changes to be expected 
when a second form of the test is given. The SE of the Stanford- 
Binet IQ is four to five points* for IO's between 90 and 110. 
The SE is slightly higher for high IQ’s and somewhat lower for 
low IQ's. Expressed in terms of chances or probability of change, 
a SE of five points means that the odds are roughly 2:1 that an 
IQ of 102, for example, will rot be higher than 107 (102 + 5) 
nor lower than 97 (102 — 5) on retest. The SE represents the 
amount of fluctuation to be expected in most cases. The change 
in a few individual cases may be somewhat greater than five 
points or somewhat less than five points. Fluctuations in ІО from 
time to time arise from many causes: changes in the testing 
situation and changes in the child being tested. When a child’s 
mental or physical health or his home or school environment 
change radically between tests, fluctuations in IQ can be ex- 


* When SEs = 16 V 1-.90, we have 5 as the approximate value of the SE. 
This is a slight overestimation, as the reliability coefficient is usually above .90. 
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pected. Mental measurement is never as precise as physical meas- 
urement: a child is a much more variable “object” than, say, a 
piece of metal. Changes in IQ from one test to anothér rarely 
shift a child from one classification to another, however (see 
Table 3-3)—that is, from normal to superior or from dull to 
normal. The consensus, in fact, is that the IQ is extremely hard to 
change and that we can accept an IQ when expertly determined 
as a reliable appraisal of a child’s general mental level. 


The Stanford-Binet IQ in Diagnosis of Child Behavior 


Children who acliieve the same mental age will differ in IQ 
when their CA's differ (page 51). Furthermore; even when the 
IQ is the same, two children may differ sharply in various 
aspects of mental development, as shown by the sorts of tests 
passed and failed and by the degree of scatter over the scale. 
The Stanford-Binet is primarily a standard test-interview de- 
signed to furnish a cross-sectional view of a child's intellectual 
capacities—that is, to give the level at which the child normally 
functions. At the same time, the school psychologist, in writing 
an account of a child's performance on the test, will usually 
note irregularities in development and learning ability, and these 
observations provide the teacher with valuable clues to an under- 
standing of the child. Visual handicaps, inco-ordinations, and 
other physical handicaps may be noted; so also may be noted 
deficiencies in arithmetic skills, in word comprehension, in rea- 
soning, and in current information. The sub-tests of the Stanford- 
Binet call for fairly specific performances, and are not sufficiently 
numerous or comprehensive to permit the final judgment that 
“John is weak in number work, but excellent in rote memory, 
or that “Sarah’s verbal facility far exceeds her manipulative 
skills.” But the pattern of a child’s responses and the relative 
strengths and weaknesses displayed on groups of items will pro- 
vide useful information. | | Ир 

Parents are often puzzled whem a child who is a discipline 
problem is, at the same time, described as above normal in in- 
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telligence. The reason, of course, is that Stanford-Binet is not a 
measure of social intelligence or of emotional stability but of 
general verbal or abstract level (page 46). At the same time, the 
observant psychologist will note and record characteristic emo- 
tional and. temperamental behavior displayed as the child takes 
the test. The rude or indifferent youngster, who doesn't care and 
who doesn’t co-operate; the spgiled and petulant “brat,” who 
gives up and pouts at the first failure: the timid and insecure 
child, who inquires eagerly “Is that right”? after answering each 
item—all these children reveal their distinctive personality traits 
by tre manner in which they tackle the test. Standards of be- 
havior in the home, ideals of conduct, values, and attitudes are 
often exhibited ‘clearly, if indirectly, in the course of a mental 
examination. At Year VII, for example, is the question "What's 
the thing to do if another boy (or girl, depending on the sex of 
the examinee) hits you without meaning to do it?” The child 
who is immature socially or reared in a rough-and-tumble com- 
munity will answer promptly “Hit him back.” The 7-year-olds 
who are better trained in acceptable social practices will qualify 
their replies, or suggest that forgiveness may be in order if the 
blow were truly accidental. 

The following case histories will illustrate how qualitative 
analysis of a child’s test performance can help the classroom 
teacher who refers him to the psychologists. 

Case I. SM, a boy; CA = 102, MA = 8-2, IQ = 80. 

This boy was referred by his'teacher because of unsatisfactory: 
work in the fourth grade. He is a good-looking, polite lad, 
normal in appearance and in social manner. Anyone unacquainted 
with his school work might judge SM. to be average in intelli- 
gence, or perhaps above average. On the Stanford-Binet, SM’s | 
vocabulary was childish with definitions in terms of use. He 
passed the vocabulary test only at Year VI. His answers to the 
picture and verbal absurdities were halting, poorly phrased, and 
uncomprehending. He had inaccurate and meager responses to 
“seeing relations” items—differences and similarities. He was 
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poor in number relations; his co-ordination and rote memory 
were fair. SM is a dull boy who may reach the eighth grade, but 
is not likely to go beyond. It was recommended that SM ünder- 
take vocational training. р 


Case II. RW, а girl; СА = 11-2, МА = 11-3, IQ = 100. 
RW is а well- “developed” girl, apparently calm and self- pos- 
sessed. She was referred by her teacher because o£ poor work in 
the sixth grade; she is described as being inattentive and given 
to daydreaming. RW seemed indifferent t to the test, but did not 
refuse to co- operate. She often asked that question be repeated, 
and the examiner suspects slight deafness. She became more in- 
terested as the test proceeded, ‘especially when she got the answers 
to several questions. Her vocabulary is at Year X, but her verbal 
ability is about normal, as shown by her ability to deal with 
pictures and verbal absurdities, name words, define abstract terms, 
and deal with similarities. Her attention was somewhat variable 
and she was easily distracted. She showed uncertainty in using 
number relations, as, for example, in making change. RW is 
normal in intelligence and should be able to do satisfactory work 
in the sixth grade. It is suspected that her daydreaming is, in 
part, a consequence of puberty. It was recommended that the 
classroom teacher check on RW's friends, outside activities and 


home conditions. 


Case III. HP, a boy; CA — 6-5, MA — 9-6, IQ — 148. 

"The second grade teacher is not sure what to do with HP; hc 
seems to know everything she is teaching. HP’s father is a 
prominent surgeon. This boy entered school at 6-1 and was put 
in the second grade. He is well mannered, normal in play and in 
social activities, and gets along well with his classmates. HP 


| whizzed through the tests for Years VI, УП, and УШ. His 


vocabulary is at Year X. He defined an orange as “а citrus fruit, 
round and yellow, comes from Florida." His co-ordination is not 
up to his verbal level, but his menzory and perception of differ- 
ences and likenesses are excellent. HP is a very brighe youngster. 
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He should be ready for high school by age 12 or earlier. He 
should now be in the fourth grade, if he is ready for it socially. 
If promotion is not feasible, a program of outside reading and 
some special attention is suggested. 


Precautions to Be Taken in Interpreting the IQ 


Some of.the factors which may influence a child’s IQ have 
been touched'upon in preceding paragraphs. To what extent the 
IQ is an index of “innate ability” will depend upon the co-opera- 
tion and motivation of the examinee, and upon how expertly the 
test has been administered. Several conditions which may affect 
the reliability of an IO are the following: + e ` 


Physical causes: Sensory defects, deafness, poor eyesight. Malnutrition 
and illness are also important. * 

Examiner: The personal equation of the examiner may be crucial. Mental 
test examiners who are poorly trained, have harsh and unpleasant 
voices, peculiarities of manner or dress, or who are supercilious or 
arrogant in their relations with the child get poor co-operation and 
uncertain test results. 

Testing conditions: Test results are likely to be unreliable when the 
examination room is bare, too cold or too warm, or overdecorated. 
Coaching on the tests must always be watched for, since the tests have 
been widely distributed. 

Environmental surroundings: The degree of stimulation received in the 
home, the school and the community will markedly affect the test 
performance. Children from homes broken by divorce or by drunken- 
ness will often show IQ increases of as much as twenty points after 
several months of kind treatmeft. On the other hand, children from 
good homes who have been transferred to a deprived and restrictive 
environment (as, for example, in war) may show sharp drops in IQ. 


Because of the many factors which may affect its determina- 
tion, a Stanford-Binet IQ should not immediately be denounced 
as worthless should there be a considerable shift in a second 
rating. Instead, a drastic change in IQ should be taken аѕ а 
challenge, and the causes ferreted out if possible. The neglected 
dull normal child when taken into а good home will often show 
an increase in measured IQ, as will a child adopted into a good 
family. By he same token, a normal child will do poor school 
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work if he is insecure and unhappy. It seems very unlikely that 
even sharp changes in IQ reflect a real alteration in a child's 
aptitude. At least all of the environmental factors should.be con- 
sidered before this conclusion is reached. ils 


Constancy of the IQ over the Age Range 


Suppose that Bob White, who is 7 years old, has an MA of 8 
years on Stanford-Binev and ап IQ of 8/7, or 114. When Bob 
is 14 years old, his MA must be 16 years if his IQ is to remain 
constant at 114 (16/14 — 114). The IQ is a measure of bright- 
ness or dullness relative to a child's age group. Hence, she vid an 
IQ fluctuate widely—as, for example, from 114 to 85 or to 140— 
the ratio MA/CA becomes valueless. We have Said earlier (page 


. $6) that when the IQ of a child has been determined by an 


expert, it is 2 highly dependable index. But whether the IQ E 
remains constant over the years from 6 to 14 (over the elemen- 
tary school, for example) will depend, for one thing, on the way 
in which the test has been constructed. This question is appro- 
priate, therefore: “Is the Stanford-Binet so constructed as to 
make a constant IQ probable or even possible?" 

There are three conditions which an intelligence test must 
mect if the IQ, defined as the ratio, MA/CA,* is to remain 
constant over the age-scale. These are: 

1. Increased spread of MA's (larger SD's) as we go up the 
age-scale. - 

2. Homogeneity of mental function over the age range cov- 
ered by the scale. Homogencity means that the test measures the 
same "intelligence" for example, from age 2 to age 18. 

3. Zero correlation between chronological age and IQ. 
These conditions are met to a high—though not perfect— 
degree by the Stanford-Binet. They are not met, even approxi- 
mately, by most group intelligence. tests (page 97). Let us 
examine each condition further. 

1. The SD of the Stanford-Binet MA distributions increases 


* The IQ may also be defined as a standard score (p. 39), The conditions for 
IQ constancy, given above, apply only to age-scalcs. 
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fairly regularly with chronological age. At Year VII, for ex- 
ample, the SD of the mental age distribution is 1 year; at Year X 
the SD is 1.6 years; at Year XIII it is 2.3 years; and at Year XVI 
it is 2.6 years. This means that if Bob White has a CA of 7 years 
and is оле SD above the mean for his age (that is, at 8 years), 
his IQ will be 8/7, or 114. If Rob maintains his rate of mental 
growth, at age 10 his MA will be 1 SD above the mean, or at 
11.6 years (10 + 1.6). Bob's IQ is now 11.6/10 or 116. At year 
13, should Bob stay 1 SD above the mean, his MA will be 15.3 
(13 +i- 2.3) and his IQ 117. And at age 16, should Bob maintain 
his rate of growth, his MA should be 18.6 (16 + 2.6) and his IO 
18.6/16 or 116. Figure 3-3 shows that when a child maintains 
an accelerated rate of growth, his IO (like Bob's) will remain 
approximately constant—that is, within 2 to 3 points. 
FIGURE 3-3 Age-Progress Curves for the Stanford-Binet Scale 


[Note that the spread'of MA’s becomes greater with increasing 
chronological age.] 
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Figure 3-3 shows that when a child is below the mean for his 
age, his IQ will again remain approximately constant should he 


maintain his slower rate of growth. If a child has an MA of 6 anda ~ 


CA of 7—is 1 SD below the mean for his age—his IQ will be 
6/7, or about 86. Should this child maintain his slower rate of 
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growth, at Year XIV his MA will be 11.8 (14-2.2) and his IQ 
will be 11.8/14, or 84. It is the increasing spread of MA's with 
increasing CA which keeps the ratio MA/CA constant within 
2 to 3 points, always provided the child maintains a constant rate 
of growth. (See Figure 3-3.) 

2. Statistical analysis hasoshown that the correlation between 
successive MA levels is*very high, and that the Stanford-Binet 
is measuring essentially the same “intelligence” as we go up the 
age-scale. 

3. When a child reaches the upper .’teens, mental grswth 
changes as shown by the Stanford-Binet fail to keep pace with 
chronological age. When this happens, the curves in Figure 3-3 
lose altitude and bend over to become parallel with the baseline. 
Failure of the MA to increase with CA leads inevitably to a fall- 
ing IO among older children and, if uncorrected, to a negative 
correlation between CA and IQ. (Negative correlation follows 
because the CA continues to increase, whereas the IQ no longer 
does—see page 52.) To overcome this fault in the age-scale, 
the authors of Stanford-Binet provide a steadily decreasing CA 
divisor from age 13 and above. This procedure bolsters up the 
IQ by lessening the denominator (CA) and thus balancing the 
decreasing numerator (MA). This means that a child's IQ does 
not bave to decrease as the child grows older—and that there is 
no systematic correlation (positive or negative) between CA 


and IQ. 


THE WECHSLER-BELLEVUE 
INTELLIGENCE SCALE* 


Description. The Stanford-Binet is sometimes used to measure 
the intelligence of adolescents and young adults, but it is not well 


* ult Intelligence Scale (WAIS), published in 1955, repre- 
eos Ж ы ET of the Wechsler-Bellevue Intelligence Scale 
(W-BIS). WAIS makes use of the same principles of construction, scoring and 
IQ derivation found in the older scale, айа the two are essentially the same test. 
W-BIS is described here rather than WAIS because it is better known and is 


still widely used. 
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suited to these groups, since the items of the test were selected to 
appeal primarily to children. A better examination for measuring 
adult iütelligence is the Wechsler-Bellevue Intelligence Scale, an 
individual intelligence test designed especially for adults, The 
Wechsler-Bellevue is, on the whole, a well-made examination. 
The group used in standardizing the test battery—that is, the 
group upon whose answers the scoring and norms depends— 
consisted of about 1700 persons choser. from a larger group of 
3500. The sample was chosen to represent the occupational dis- 
tribution of the adult population at the time of the 1930 census. 
Tka sample is adequare in size, but the fact that it was drawn 
mostly from New York City and New York State renders ques- 
tionable its clair: to represent the country as a whole. 

The Wechsler-Bellevue Scale consists of two parts, a Verbal 
Scale and a Performance Scale. Language is required in the first 
scale, but the tests in the second part demand no language in the 
actual solution of the problems. Directions, however, are given 
orally. What is called the Full Scale is a combination of the 
Verbal and Performance sections. The Verbal Scale is made up 
of five tests, as follows: 


VERBAL SCALE 


1,, General Information: Twenty-five questions covering a wide range of 
common information and dealing with facts which all normal adults 
have presumably had a chance to learn. Questions are graded in 
difficulty from easy to hard. 

2. General Comprehension: Ten questions and two alternates, in each 
of which the examinee is asked to tell what should be done in certain 
situations, or why certain practices should be followed. The questions 
are planned to measure practical judgment, common sense, and 
understanding. 

3. Arithmetic Reasoning: Ten mental arithmetic problems. Each problem 
is presented orally and must be solved without the use of paper or 
pencil (“in the head"). 

^. Digits Forward and Backward: Memory span for digits presented one 
at a time and ranging in number from 3 to 9. In the second part of 
the test, examinee must give the list of numbers in reverse order. 

5. Similarities: Twelve word-pairs, each pair alike in some way. The 
examince must say in what way the two words are alike. 
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6. Vocabulary (Alternate); A. list of forty-two words graded in diffi- 
culty to be defined orally. 


"There are five tests in the Performance Scale, as folluws: 


PERFORMANCE SCALE 
1. Picture Completion: Fifteen cards, each containing a picture from 
whieh some part is missing. The examinee must give the missing part. 
2. Picture Arrangement: Six sets of pictures, each set containing from 
three to six separate pictures. The examinee is to arrange the pictures 


in any given set so that they tell a story. 
3. Object Assembly: Three form-boards—the Manikin, the Profile, and 


the Hand. The parts of each form-board must be put together, much 
as in a jigsaw puzzle, to form a complete’ object. g^ 

4. Block Design: Sixteen small cubes (blocks) colored red, white, and 
red-and-white on the sides. The blocks are to berarranged to match 
seven designs presented on test cards. The designs require from four 


to sixteen cubes. 
. 3 И B Sat H 
5. Digit Symbol: A well-known association test. Nine numbers are 


matched with nine symbols in accordance with a key. 

Samples of the items from the ‘performance part of the 
Wechsler Adult Intelligence Scale are shown in Figure 3-4 
(facing page 54). These tests are “performance” in the sense 
that the examinee in solving the problem must make use of 
diagrams, pictures, form boards, and cubes. But “ideas’—that 
is, symbols—are certainly not excluded. Wechsler’s performance 
tests, therefore, are measuring abstract rather than motor or 


mechanical intelligence. 

Scope. The Wechsler-Bellevue Scale provides scores in the 
form of “IQ’s.” Norms run as low as 10 years, but the scale’s 
over the age range from about 20 to 60 
years. Beyond 60 years, Wechsler-Bellevue IQ's are not always 
dependable, owing in part to the small samples at advanced age 
"levels. But these IQ's may be taken as useful estimates of general 
vel scores on the Full Scale (Verbal 4- Per- 


intelligence. Age-le 
i 20, the drop i 
formance) show a gradual decline after 20, the drop in score 
ng about 20 per cent. 


from age 20 to age 60 bei 
Scoring. Following the 
(Manual), the examiner 


principal application is 


directions given in the scoring guide 
first adds up the items’ done correctly’ 
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(speed is sometimes a factor) for each of the ten sub-tests. Scores 
on each sub-test are then converted into standard scores (page 
38), in which the mean for the 20-34 age group is set at 10 
and the SD аг 3. Conversion of the Separate sub-test scores 
into a common standard score scale allows the examiner to com- 
bine the tests into a single index and thus to compare the ups and 
downs in performance from sub-test to sub-test. 

The Wechsler-Bellevue does not provide mental ages, since 
the concept of mental age, though useful with children, has little 
meaning when applied to normal adults. The Scale does provide 
for an IQ (called a “deviation IQ"), which ¿s essentially a stand- 
ard score. There are three IQ's, one from the Verbal, one from 
the Performance, and one from the Full (combined) Scale. In 
each case these deviation IQ’s are found in the following way: 
Scores on the sub-tests (10 for the Full Scale) are added and the 
total is converted again into a standard score, this time with a 
Mean = 100 and a SD = 15 (page 39). At each age level (for 
example, at 30, 40, 50) the mean score got from the sub-tests is 
set at IQ 100. A score which is one SD above the mean at any 
age level then becomes an IQ of 115. Putting the IQ for each age 
level at 100 adjusts for the steady fall in total test score with age. 
Standard score IQ's or "deviation IQ's" below 100 denote the 
same degree of retardation with reference to one's age group. 
For example, we read from the Manual that a man aged 35 who 
achieves a score of 75 on the 10 tests of the Full Scale has an IQ 
of 92—415 slightly below the miean of his age group. The same 
score of 75 becomes an IQ of 96 at age 45 and an IQ of 100 at 
age 60. This means that a total score of 75 on the 10 sub-tests is 
"normal" (or "at age") for age 60 and hence receives an IQ of 
100. But the score of 75 is below the mean for thc younger 
groups. Again, the examinee who earns a score of 90 (Full Scale) 
has an IQ of 109 if he is 57, an IQ of 102 if he is 37, and an IQ 
of 94 if he is 22 years old. ' 

To summarize, Wechsler-Bellevue Scale IQ’s are converted, 
or standard, scores in which the mean is always 100 for each age 
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grcup and the SD is 15. Wechsler-Bellevue IQ’s have the same 
meaning from one age to another in the sense that an IQ of 105 
or of 86 implies the same relative superiority or inferiority to the 
examinee’s age group. The Wechsler-Bellevue JQ is a standard 
score, whereas the Stanford-Binet IQ is a ratio, MA/CA. The 
two indices are highly correlated, but are not equivalent. To 
avoid confusion, it helps to write “Wechsler-Bellevue IQ” when- 
ever the deviation IQ is meant. Both the Wechsler-Bellevue IQ 
and the Stanford-Binet IQ are measures of abstract intelligence 


(page 46). 


THE WECHSLER-BELLEVUE SCALE 
IN THE SCHOOLS " 


The Wechsler-Bellevue Scale has been widely used in the 
individual study of adolescents and older students for whom the 
content of the Stanford-Binet is inappropriate. The test is most 
valuable, therefore, to teachers in the upper grades and in high 
schools and technical schools. 


Range and Stability of Wechsler-Bellevue IQ's 


The range of IQ's in the general school population is about the 
same for the Wechsler-Bellevue Scale as for the Stanford-Binet. 
Table 3-3 will serve, therefore, as a guide in the interpretation of 
a test score. Table 3-3 may be taken also as providing a statement 
of the educational expectations of older students when we know 
the Wechsler-Bellevue IQ. The reliability of the Wechsler- 
Bellevue Scale, as given by its standard error, is approximately 
five points. Hence, the IQ from this scale has about the same 
stability as the Stanford-Binet IQ. 


The Wechsler-Bellevue Scale in Diagnosis 


The Full Scale—like the Stanford-Binet—yields a measure of 
a student's general mental level and is often used to provide this 
information. The Wechsler-Bellevue, however, has also been 


68 Individual Intelligence Scales 


widely employed in mental hospitals and clinics for the diagnosis 
of abnormal behavior. The Scale has been useful in the study of 
variations of performance in schizophrenia and other mental dis- 
eases, in senile deterioration, and in assessing the effects of brain 
damage and the results of brain surgery. The fact that there are 
eleven separate tests (six Verbal ard five Performance) in the 
Full Scale has led clinical psychologists to attempt to discover the 
relative efficiency of various mental functions from irregularities 
in test performance. 

The diagnosis of differential abilities (strengths and weak- 
nesses) from the sub-tests of the Wechsler-Bellevue must always 
be taken as tentative, though an examination of the different sub- 
tests may provide valuable clues. The various tests of the Scale 
are too short and too complex (in that they test overlapping 
abilities) to allow a sweeping judgment to the effect that "Bill 
has poor planning capacity and poor judgment" or that "Mary 
has a good memory and adequate concentration." Observations 
of this sort are valuable only if made cautiously and taken in con- 
junction with other evidence. The Full Scale is a good index of 
present mental efficiency, and the difference between the Verbal 
and Performance IQ’s may be significant of the academic vs. the 
non-academic "mind" (page 75). But judgments drawn from 
specific sub-tests with respect to strengths and weaknesses in 
memory, learning, perception, planning capacity, concentration, 
emotional blocks, and the like must be taken as suggestive rather 
than conclusive. à 


WECHSLER INTELLIGENCE SCALE 
FOR CHILDREN 


Description. The WISC, as it is called, is a downward revision 
of the older Wechsler-Bellevue to render the test more suitable 
for young children. There are ten sub-tests and two alternates 
(twelve in all) in the WISC. The sub-tests have the same form 
and cover the same content as the Wechsler-Bellevue, except 
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that easier items have been added. Tests are grouped into five 
Verbal and five Performance as follows: 


” 


Verbal Scale Performence Scale 


General Information Picture Completion 
General Comprehension Picture Arrangement 
Arithmetic Block Design 
Similarities Е Object Asserably 
Vocabulary (Digit Span) Coding (or Mazes) 


The Wechsler Intelligence Scale for Children differs in several 
respects from the^Wechsler-Bellevue. In the Verbal Scale, 
Digit-Span proved to be less satisfactory than the other tests and 
hence became an alternate, Vocabulary being substituted. In 
the Performance Scale, coding is a somewhat easier version of the 
Digit-Symbol test. Mazes are sometimes given instead of coding, 
but the second test is usually preferred, since it takes less time 
to administer. The maze test is the only test not found in the 
Wechsler-Bellevue. 


Scope. The WISC is a better made test than the Wechsler- 
Bellevue. To provide norms, one hundred boys and one hundred 
girls were tested at each age level from 5 to 15. Children in the 
standardization sample were drawn from eleven states and from 
three institutions for the feebleminded. The sample was carefully 
checked to give a cross section of geographic areas, urban-rural 
groups, and occupational levels of parents. 


Scoring. As was true of the Wechsler-Bellevue, all sub-tests 
were first converted into standard scores in a distribution with 
M = 10 and SD = 3. Tables are provided for reading scale 
score equivalents to raw scores for each 4-month period from 5 
to 15 years. These equally weighted sub-test Scores are added 
and then again converted into “deviation IQ's," with Mean = 
100 and SD — 15 (page 39). Verbal, Performance, and Full 
Scale IQ's may be read from appropriate tables in the Manual. 
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Approximately 50 per cent of school children can be expected 
to earn WISC IQ’s between 90 and 110. 


Differences Between the WISC and the Stanford-Binet. The 


WISC differs from the Stanford-Binet in several important ге-, 


spects. First, all items of a given sort in the WISC are organized 
into sub-tests instead of different kinds of items being placed at 
successive age levels. WISC is 2 point scale rather than an age 
scale. Second; the WISC IQ is a deviation IQ—a standard score 
in a distribution with Mean = 100 and SD = 15—whereas the 
Stanford-Binet IQ is a developmental ratio or MA/CA. The two 
IQ's ace closely related (the correlations between the two sorts of 
scores run from .80 upward), but they are not identical (page 
66). The SD of the Stanford-Binet distribution of IQ's is 16, as 
against the WISC SD of 15; and some of the difference between 
the two IQ's is due to the greater spread of the’ Stanford-Binet 
IQ's. Furthermore, the two mental examinations differ in length, 
variety and difficulty of items. Finally, the WISC provides for 
three IQ's—a Verbal, a Performance and a Full Scale. There is 
only one IQ from the Stanford-Binet, based upon all of the tests 
in the scale. 


THE WISC IN THE SCHOOLS 


Both the WISC and the Stanford-Binet are widely used with -+ 


school children, and in most cases there is little to choose between 
the two examinations. Many psychologists regard the Stanford- 
Binet as more satisfactory for use with very young children, 
since the WISC is not always easy to administer when the child 
is under seven years old. WISC takes less time to give and to 
score than does Stanford-Binet, and some examiners prefer it 
over the age range of the elementary school. The WISC Full 
Scale IQ has a higher correlation with. Stanford-Binet IQ than 
does either the Verbal or the Performance Scale IQ. 


Bright children tend to score higher on the Stanford-Binet than © 


on the WISC, whereas duil pupils score higher on the WISC. 
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"The separate IQ's of the WISC (Vefbal and Performance) are 
often valuable in bringing out differences.in verbal and manipu- 
lative skills. Sometimes a child (usually a boy) will do, berter on 
the performance tests of the WISC than on the verbal indicating, 
perhaps, greater aptitude for vocational than for academic sub- 
jects. À bookish youngster who reads a great deal may do much 
better on the verbal tests. The, performance IQ is usually higher 
than the verbal in severely disturbed adolescents,-and this differ- 
ence often appears also in younger dull students. From the 
manner in which the child handles the verbal tests, the expert 
examiner will often note evidences of irsecurity as revealea by 
incoherence, verbosity, poor attention, and defeatism. Poor 
performance on the manipulative tests often réveals inept plan- 
ning and defective co-ordination, whereas good performance 
shows concentration and adequate sensory-motor organization. 


Range and Stability of the WISC IQ's 


The range of WISC Full Scale IQ's to be expecteu in the 
general school population, and the meaning of these "scores" are 
shown in Table 3-4. « 

TABLE 3-4 


Intelligence Classification for WISC IQ's 
____ _ ус — _ _—_-—— 


Регсеп! 
IQ Ranges Classification in Each Group 
130 - very superior 2 
120 - 129 superior z 
110 - 119 bright normal 16 
90 — 109 average 50 
80 – 89 dull normal 16 
70- 79 borderline 7 
69 below mental defective 2 


t these classifications correspond closely to those 


It will be seen tha 
for Stanford-Binet IO's. 
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Reliability coefficients for the WISC are generally above .90. 
"They аге higher for the Full Scale than for either the Verbal or 
Performance Scales. The standard error of a WISC IQ is 4-5 
points. 


MA's from the WISC 


The WISC does not ordinarily nfake use of mental age, but 
when mental ages are required for clinical or for legal reasons. 
they can readily be determined. The Manual (Appendix E) pro- 
vides a table of “test-age equivalents to WISC raw scores." By 
reference to the table, we find the chronological age of a child 
for whom a given raw score is typical (or average), and this is 
the MA corresponding to the score. For example, a score of 12 
on the Comprehension test is achieved on the average by children 
who are 10-6 years old. Hence, a score of 12 in Comprehension 
has an equivalent MA of 10-6. The mean of the sub-test MA’s is 
computed (Mean Test-Age Method) or the median of the MA’s 
(Median Test-Age Method). Either of these determinations gives 
the final over-all MA. A closely equivalent method for determin- 
ing MA's from the WISC is by use of the formula MA — IQ x 
CA. A child who achieves an IQ of 110 and who is 8-2 years old 
has a MA of 110 X 8-2, or approximately 9-0 years. 


PERFORMANCE TESTS 
Development of Performance Tests 


Performance tests designed to measure general mental ability 
have been often used in the schools (1) as substitutes for the 
more verbal tests, and (2) as supplements to the Stanford-Binet 
and other linguistic scales. Performance and non-language tests 
must of necessity be employed with pre-school children and with 
the very dull. Such tests are useful additions to the Stanford- 
Binet or WISC in the mental examination o£ children with speech 
and language defects or children with visual and auditory impair- 
ment. Batteries of performanée tests have long been used in 


a 


Р 


E 


Rc 


RT AE 


fhe Arthur Scale А 73 


psychological clinics amd in institutions for the feebleminded. 
The classroom teacher should know about performance tests, 
though he will encounter them much less often than the WISC 
or the Stanford-Binet. ` 

The Pintner-Paterson Scale of Performance Tests (1917) was 
the first organized battery of manipulative and non-language 
tests. Widely used for many years, this scale has now been re- 
placed to a considerable degree by other batteries, based upon 
it. The Pintner-Paterson Scale consists of fifteen separate tests. 
The ten tests most often used (in what is called the Shorter 
Scale) include four, form boards, three picture completien tests 
(of the jigsaw puzzle type), two object assembly tests, and 
one block-counting test. d 

Later performance scales are the Cornell-Coxe Performance 
Ability Scale (1934) and the Arthur Point Scale of Performance 
Tests (1930, revised 1947). These test batteries draw heavily on 
the Pintner-Paterson, but include, too, important additions and 
revisions. In addition to these test batteries, there are a number 
of other performance tests, of which a graduated series of mazes, 
the Porteus Mazes, is the best known. Widely used types of per- 
formance tests are the object assembly (page 65), various form 
boards, block counting, and block design. Two of these, block 
design and object assembly, are found in the Wechsler-Bellevue 
Scale. 

Norms are generally available for the individual performance 
tests, so that one may use опе Dr more tests without having to 


administer the whole scale. 


The Arthur Scale * 

The Arthur Point Scale has been widely used over the age 
range of thc elementary school. It is made up of performance 
tests taken from various sources; it was first published in 1930 
and revised later in 1947. The later.edition is a considerable im- 
provement over the original insofar as standardization is con- 
cerned, and the Scale is a good example of a performance battery 
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designed for children. Figure 3-5 (facing page 53) shows the five 
tests fà the Arthur Point. 
"There are five tests in *he Arthur Point Scale, as follows: 


Knox Cube. The four cubes (see Figure 3-5) are tapped in a 


certain order by the examiner, for example, cube 1, cube 4, 
cube 2, cube 3, The child is told to imitate the tapping order. 
Tapping sequences become longer end more complex, until 
they can no longer be done by the child. 

Seguin Form Board. Ten common geometric forms (Figure 3-5) 
сте to be fitted into, the right apertures in.the board. 

Porteus Mazes. The child is told to trace thé shortest path from 
the entrance <o the exit in a maze, not lifting the pencil from 
the paper. If he makes an error by crossing a line or entering 
a wrong pathway, he is stopped and given a second trial. Mazes 
increase in difficulty from 3-year level to adult. 

Healy Picture Completion 11. As shown in Figure 3-5, the test 
shows successive scenes in a boy’s life during a typical school 
day. Small pieces or blocks have been cut out of the scene. The 
child must select the appropriate pieces from the box and fit 
them-in place. 

Arthur Stencil Design Test. The child must reproduce designs 
of increasing complexity. Standard designs to be copied are 
presented on cards. Each design can be reproduced by fitting 
together stencils in different colors on a solid white card. 
Several stencils are needed for the more detailed designs. 


Scope. The Arthur Performance Scale covers an age range 
from about 4 years to maturity, but is used chiefly with younger 
children. The Scale is employed mainly as a clinical test supple- 
mentary to, or as a substitute for, the Stanford-Binet. 


Scoring. Scores on the sub-tests (based on accuracy and time) 
are first converted into point scores. These are combined ‘and 
converted into mental ages. MA’s are chronological ages which 
are typical for given combined scores. Thus if the average child 
of 10-0 scores 31 points, a score of 31 points becomes an MA of 
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10-0. МА is divided by СА to give а "performance IQ." These 
IQ’s are not equivalent to the IQ’s from verbal intelhgence 
scales and are not to be so regarded. Arthur Scale ІС should 
always be described as “Arthur Scale ТОЗ.” » 


Performance Tests in the Schools 


Correlations between’ scores on the Arthur Scale and the 
Stanford-Binet are fairly high (.50 or more). The two tests are, 
however, not measuring exactly the same functions, and hence , 
the Arthur IQ is often used as a “performance supplement to 
the Stanford-Binet IQ. Arthur IQ's are higher than Sranford- 
Binet IQ's when the latter are low, that is, befow 90; and this 
discrepancy is especially striking when children are very dull. 
"There is evidence that low performance test scores may be in- 
dicative of. behavior problems and of emotional instability: This 
result probably grows out of the disturbed child's poor attention 
span, poor perception of relations and ineptitude in manual activ- 
ities. Emotional involvement may take expression in bizarre and 
unusual responses. 

For the classroom teacher the main value of a performance 
test lies, perhaps, in the fact that such tests (1) may reflect poor 
language development or lack of language training, and are (2) 
often indicative of cultural and educational handicaps. As 
pointed out on page 68, a comparison with verbal tests often 
reveals, for instance, children Whose manual and manipulative 
skills (“concrete intelligence") run ahead of their verbal facility 
(“abstract intelligence"). Performance tests serve, too, to identify 
the shy and inarticulate child who is brighter than the verbal 
tests show. Performance tests are not especially useful with 
normal school children over 12 years of age and they rarely 
differentiate significantly among older bright children. 


Case Histories. 'The following brief case histories will illustrate 
how performance tests, when used together with verbal tests. 


may provide a better understanding of a pupil's capabilities. 
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Recommendations in most cases must be tentative and subject to 
possille revision in the light of further information. 


Case I. Donald В.: age, 10-2; Stanford-Binet IQ, 92; Arthur 
Scale IQ, 106. 

Donald is doing poor work in the fifth grade. His father is a 
barber, his mother a clerk (part-time) in a store; neithér parent 
went beyond the seventh grade. There, are three other children 
in the family, all younger than Donald. There are few books in 
the home, but the family owns a TV set and a new automobile. 
Denald reads the sports page and the comics in the daily news- 
paper, but little else. He talks in brief ѕепгелсеѕ and is generally 
unresponsive ir. school. He is a well-grown boy for his age, a 
good athlete, and is well accepted by his classmates. He has 
never been a behavior problem. 

Recommendation: Donald’s performance IQ is fourteen points 
higher than his Stanford-Binet IQ. In view of his relatively 
meager abstract intelligence, this boy is probably doing as well 
as we can expect. He may get to high school, but will almost 
certainly not complete more than one year. Vocational training 
seems to be indicated. He will continue to have trouble with 
verbal subjects, but may be very successful at a skilled trade. 


Case II. Joan M.: age, 8-3; Stanford-Binet IQ, 126; Arthur Scale 
IQ, 109. 

Joan is doing excellent work in the fourth grade. Her problem 
is social rather than scholastic. Her father is dead, and her mother, 
a widow, is a successful dress designer. Joan is an only child and 
is alone much of the time. She reads a great deal but has few close 
friends and is often left out of class activities. She has a tendency 
to daydream and is shy and withdrawn. 

Recommendation: Joan’s low performance IQ, coupled with 
her high Stanford-Binet IQ, indicates a lack of experience with 
“concrete” activities, such as running, playing out-of-doors, 
skipping rope, dancing, and the like. This lack of Opportunity to 
develop manual skills is often found in children reared in a large 
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city. Joan's mother maj be encouraged to mcet with other class- 
room mothers, arrange parties, and invite Joan’s classmates to 
her home. The aid of the physical education teacher-in getting 
Joan into games should also be sought. The classroom teacher 
can often see to it, by suggestion and indirection, that Joan is 
included in class parties and out-of-class activities. 


Case III. Bob W.: age, 9-2; Stanford-Binet IQ, 104; Arthur Scale 
IQ, 115. 

Bob is doing satisfactory work in the fourth grade. He is shy 
and timid, with a slight tendency to stammer, especially x.hen 
questioned. He is óne of four children, the other three being girls 
all older than he. Bob's father is a ѕиссеѕѕ ш” lawyer, and his 
mother is a college graduate and a prominent club woman. The 
parents have decided that Bob, as the only boy, is to be a pro- 
fessional man, preferably a physician (his grandfather was a well- 
known surgeon). They are dissatisfied with Bob’s marks, and 
are sure he is intelligent and that the teacher is to blame. 

Recommendation: Bob is clearly a normal boy. He is not 
bright, though he probably is brighter than his Stanford-Binet 
IQ indicates. The parents must somchow be reconciled to the 
fact that (1) Bob is not of professional caliber, and (2) a lower 
vocational goal (one within Bob’s intellectual grasp) will make 
for a happier boy and probably a much happier life. They must 
be urged not to scold the boy and thus make him ‘feel more 
inferior than he already does. This is a difficult problem, because 
it is the parents—not the child—who have to be “sold” on a 
different program from the one they have planned. 


SUGGESTIONS FOR FURTHER READING 


General: У = 
Anastasi, A. Psychological Testing. New York: Macmillan, 1954. 
Cronbach, L. J. Essentials of Psychological Testing. New York: Harper, 


1949. : | е р 
Freeman, Е: S. Theory and Practice of Psycbological Testing (Rev. 


Edition). New York: Holt, 1955. 
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Specific: ° 

Arthuz, G. A. A Point Scale of Performance Tests. Revised Form I]. 
Manual for Administering and Scoring the Tests. New York: Psychologi- 
cal Corp., 1947. 

‘McNemar. О. The Revision of tbe Stanford-Binet Scale: An Analysis 
of the Standardization Data. Boston: Houghton Mifflin, 1942. 

Terman, L. M., and Merrill, M. A. Measuring Intelligence. Boston: 
Houghton Miffin, 1927. 

Wechsler, D. The Measurement of Adult Intelligence (3rd edition). 
Baltimore: Williams and Wilkins, 1944. 

Wechsler, D. Wechsler Intelligence Scale for Children. Manual. New 
York: Psychological Corp., 1949. 


SUGGESTIONS FOR LABORATORY WORK 


1. Examine the Stanford-Binet items at ages 4, 8, 12, and Superior 
Adult. Classify the items at each age level as verbal, numerical, spatial- 
perceptual (for example, mazes and the like), and performance (manip- 
ulative). Add other categories if you need them. Which category has 
the largest number of items? 

2. Have members of the class pair off and test each other. Be sure to 
follow the Manual carcfully. Results from this "test" will not be indica- 
tive of mental ability, to be sure, but following the procedure is a good 
way to learn about the test. 

3. Repeat (1) and (2) for the Wechsler Intelligence Scale for Children. 
For (1), sample the items of cach test. 

4. Go over the Manual of the Arthur Point Scale. If materials are 
available, administer the Scale to a child before the class. 


! QUESTIONS FOR DISCUSSION 


1. What importance do you attach to the fact that test items in 
Stanford-Binet become more "verbal" as we go up the age scalc. 
2. Which test, Stanford-Binet or Arthur Point Scale, would you expect 
to prove morc effective in the following situations: 
3) selecting children for a special class for the gifted 
b) selecting children for remedial work in a "slow" class 
c) studying children with reading problems 
d) testing children with speech defects 
3. А child taken from public school and entered in a private school 
is reported by his mother to have shown an increase in IQ of 20 points 
after siY months in the "new" school. Assuming the story to be true, 
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what is misleading about it? What might account for the change in the 


TIQ? М 
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4. A high school boy of 16 has a Wechsler-Bellevue IQ of 132. What 
advice would be justified by this fact alone? Р ` 

5. Look over the items in the WISC. Which do yòu think depend 
primarily on schooling? Do the same for the Stanford-Binet. Which test 
is the more “school centered”? 

6. Terman states that the vocabulary test gives the closest approxima- 
tion to total performance он Stanford-Binet. What doss this tell us 
about the nature of the Stanford-Binet IQ? І 

7. In deploring the reading interests, TV programs, and voting habits 
of the American adult, critics have said that the average mental age of 
the adult is about 14 (sometimes this 15 12 or 15). What does mental age 
signify here, if anything definable? 

8. Does a child with an IQ of 80 possess 80 per cent or normal intelli- 
gence? Explain vour answer. 


L] 


СНАРТЕК 4 


GROUP TESTS OF INTELLIGENCE 


Group and Individual Tests of Intelligence 


Group tests of intelligence are much like individual tests except ` 


that (1) they are administered? like school examinations, and (2) 
they are objective in form—are answered by checking or circling 
a number or letter, or by marking one of several possible rc- 
sponses. Group tests contain both verbal and non-verbal ma- 
terials. Items of the first sort are expressed in words and numbers; 
non-verbal test items, on the other hand, consist of problems 
presented in pictures and diagrams. There is a minimum of 
language and little or no reading required in non-verbal items. 
Intelligence tests for pre-school and first-grade pupils are of 
necessity non-verbal, though directions are given orally. Intelli- 
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gence tests in the elementary grades contain both verbal and non- 
verbal items. At the high school and college levels, test items are 
mostly verbal, mathematical, and abstract, but even hore папу 
problems are presented in pictorial and spatial forms. 

Group tests of intelligence confront the examinee with tasks 
like those found in the individual intelligence scales. Both types 
of test minimize routine соо] learning and emphasize mental 
alertness by presenting problems which demand г easoning, gen- 
eralization, and the manipulation of “ideas.” But there are differ- 
ences, too, between the two sorts of test. In individual intelli- 
gence scales, questions are stated orally and are answered orally; 
moreover, problems are presented one at a time without time 
limit, or a generous limit is allowed. In group intelligence tests, 
questions are printed in a booklet, time limits are fixed, and 
answers are limited to the options provided. The group test is 
more dependent on reading than is the individual test, it is less 
flexible in response, and it is often disturbing to children who are 
easily flustered by a time limit. When а child's school work 
and/or the teacher’s opinion of his abilities do not jibe with his 
group test score, it may be advisable to check the group test 
result against the Stanford-Binet. Group tests, like individual 
scales, are concerned almost entirely with the abstract level of 
intelligence (page 46). Vr s 

The first group tests to be widely used were the two intelli- 
gence examinations developed for use in the army during World 
War I (1917-1918). Army Alpha consisted of eight sub-tests: 
as, Arithmetic Problems, Best Answers, Dis- 


Following Directior 
i ies Completion 
arranged Sentences, Same-Opposites, Number Series Completion, 


Analogies, and Information. Army Beta made use of diagrams 
and pictures, and directions were given in pantomime. During 
World War II, the Army General Classification Test (AGCT) 
was developed as a measure of general ability. Unlike Alpha, 
the items in AGCT were not grouped into sub-tests, but were 
printed in ascending order of difficulty. A civilian edition of 


AGCT is now available. 
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REPRESENTATIVE GROUP TESTS OF 
Е INTELLIGENCE 


This séction will describe several tests of general intelligence 
covering the age range from pre-school to college. "These test 
batteries have been chosen for illustration because they are well 
standardized, are widely used in the schools, and are representa- 
tive of a large assortment of group tests designed to measure 
general ability. They are not necessarily the best mental examina- 
tions for every testing situation nor for every school. Sclection 
of.a "best" test will depend on the objectives which the school 
hopes со achieve, the time available for testing, and the money 
and personnel which the school has available. 


GROUP TESTS OF INTELLIGENCE 


Pintner-Cunningham Primary Test 
California Test of Mental Maturity 

Otis Quick-Scoring Mental Ability Tests 
Kuhlmann-Anderson Intelligence Tests 
Terman-McNemar Test of Mental Ability 


American Council on Education Psychological Examination 


س رم نٹ am ж‏ 


1. The Pintner-Cunningham Primary Test* 


Description. This test includes seven non-verbal sub-tests de- 
scribed as follows: 


1. Common Observation: Child marks all of the objects in a. 
given set which fit into some category. (Sec Figure 4-1, row 1.) 

2. Aesthetic Differences: The child is told to mark the 
“prettiest” (that is, best) of three drawings of the same object. 
(Figure 4-1, row 2.) 

3. Associated Objects: The child marks the two objects that 
belong together in each row of pictures—as, for example, the 
hat and the coat. (Figure 4-1, row 3.) 

4. Discrimination of Size: The pupil is instructed to mark the 


* Published by the World Book Company, Yonkers-on-Hudson, N. Y. 
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FIGURE 4-1 = Illustrative Items from tbe Pintner-Cunningham 
Primary Test 8 


{01697 


Test 1. Mark the things .һо! Mother uses when she sews ^er apron. 


Test 2. Mark the prettiest house. 


bxc CE 


Test 3. Mark the two things that belong together.* 


e ө ө ооо 
Test 7. Look о! each picture. See how it is drawn. Make another опе 
like it in the dots. E 
Reproduced by permission of the World Book Company. 


items of clothing which are of the right size for the individual 
pictured. For each article of clothing—shoes, hat, gloves, etc.— 
one is too large, one is too small, and one is of the right size. 
5. Picture Parts: In this test a series of pictures of increasing 
hown. These contain children, toys, animals, and 
complexity i is show 1 de 
imc items are shown outside the "stan ar 


other items. The s: 
i i . The child is instructed to 
picture, mixed in with other objects 1 
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mark all of the objects in this group which appeared in the 
picture. А 

6. Picture Completion: In each incomplete picture, the pupil 
is asked to locate and mark the correct missing part from among 
several parts shown. 

7. Dot Drawing: The child is to copy drawings which аге 
formed by joinirg dots. See Figure 4-1, row 4. 

All the tests are non-verbal, since nost of the children for 
whom the test is intended have not learned to read. Directions 
are given orally. 


"Scope. The Pintnér-Cunningham Primary Test covers the 
kindergarten, Grade I, and the first half of Grade-II. There are 
three equivalent forms, A, B, and C. 


Scoring. Scores from the seven sub-tests are combined to give 
a total point score. Mental ages corresponding to point scores 
may be read from tables in the Manual. Pintner-Cunningham 
МА? are chronological ages for which the given point scores 
are typical (see page 33). These MA’s are divided by the child's 
CA. to obtain an IQ. An alternate—and better procedure—is to 
convert the point scores into deviation IQ's, following the 
method used in the Wechsler-Bellevue. The mean IQ is, of 
course, 100 and the SD is 16, equal to that of the Stanford- 
Binet. Pintner-Cunningham IQ's are not strictly equivalent to 
Stanford-Binet IQ’s, though the same abilities appear to be 
measured by the two scales. Correlations between the two test 
batteries run from .70 to .90 for kindergarten and primary school 
children. This indicates that Pintner-Cunningham is a valid meas- 
ure of the abstract intelligence measured by Stanford-Binet. The 
reliability or stability of Pintner-Cunningham scores is high, as 
shown by the close correspondence of one form with another. 


2. The California Test of Mental Maturity (CTMM)* 


Description. These tests contain both verbal and non-verbal 
materials. Sub-tests are grouped under the following five heads: 


* Published by the California Test Bureau, Los Angeles, Calif. 
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memory, spatial relations, logical reasoning, numerical reasoning, ` 
and verbal concepts. Fach of these categories is represented. by 
from two to four tests. The profile in Figure 4-2 gives the-nanies 
and classification of these sub-tests. The first threc tests in each 


FIGURE 4-2 Profile for tbe California Test of Mental Maturity 
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California battery are designed to measure visual acuity, auditory 


acuity, and motor co-ordination. These tests, which are in a «^ 


separate booklet, are rough screening devices intended to 
identify children too handicapped to be correctly classified by 
the test battery. 


Scope. The California test series covers the range from kinder- 
garten to college. The five batteries are as follows: 
1. Pre-primary level: kindergarten and first grade 
2. Primary level: grades 1-3 
s 3. Elementary level: grades 4-8 
4. Junior high level: grades 7-9 
5. Advanced level: grades 10-college and adult. 
These test batteries require about 1% hours working time. They 
are relatively easy to administer and to score. 


Scoring. Ѕерагаге:ѕсогеѕ are obtained for each of the five areas 
(called “factors”) into which the sub-tests have been grouped. 
"There are-also scores (and mental ages) based on (1) the lan- 
guage or verbal tests alone, (2) the non-language tests alone, and 
(3) the test as a whole. From these three scores, separate MA’s 
may be read from tables in the Manual. Language, non-language, 
and total-test IQ’s are found by dividing the appropriate MA 
by the child’s CA. Percentile ranks are also provided for each of 
the five “mental factors.” These PR's may also be read from 
appropriate tables. 

A special feature of the CTMM is the use of a profile or chart 
as an aid in analysis and diagnosis. As shown in Figure 4-2, the 
highs and lows of a pupil’s performance in the five areas may be 
readily seen from their positions on the profile. Along the right- 
hand margin of the chart, percentile ranks (PR’s) are entered 
for each factor, as well as for total score and for the language 
and non-language parts of the test. These PR’s give the student’s 


standing on a scale of one hundred points (page 33). If the РК, - 


is 50, the child stands just in the middle of his age group; if his 
PR is 70, then 70 per cent of his age group fall below him in 
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The validity of the CTMM was determined through its cor- 
relations with Stanford-Binet and other standard mentai tests. 
"The tests appear to be very homogeneous (to measuré the same 
abilities) over the age range from pre-school to college. The 
reliability of the language, non-language, and total scores is high: 
reliability coefficients of these part scores and of the factor scores 
range from .87 to .95 over grades 4-6. м 


3. Otis Quick-Scoring Mental Ability Tests* 


Description. These tests differ from many group tests. of in- 
telligence in that tüe test items are not grouped into separate 
sub-tests according to type of item. Instead, the uifferent items— 
analogies, arithmetic problems, opposites and the like, are printed 
in a continuous repetitive pattern, so that items of a certain sort 
(opposites, for example) follow each other at stated intervals. 
This arrangement is sometimes called a “scrambled” test, or more 
precisely a spiral omnibus arrangement. Items are progressively 
more difficult from the start to the finish of the test. 

The following items are like those in the Otis Beta Test,** an 
exaniination prepared for grades 4-9. 

Т 209 M4 as 
1. Which of the five things below is soft? ` СС 0€) 


1. glass 2.stone 3.cottom 4.iron 5.ice 
1 2 3723245 


А robin is a kind of C) @ 036956 5 
1. plant 2. bird :3. worm 4. 5:25. flower ` 


N 


1 42:35 tos 


3. Наг is to head as shoe is to Соса еу 


1. агт 2.leg 3.foot 4. б: 5. glove 
1 2 Ж ж g 


4. North ` C2€2 62:09 EA 
l.hot 2.east 3.west 4.down 5. south 
L o2 2 # 5 
5. Atfive cents each, how many pencils can be C) WU) (5 
bought for 40 cents? 
1.45 2.9 3.200.4.5. 5.12 


* Published by the World Book Company, Yonkers, М.Ү. © 
** The first two items are samples from the Beta Test. Other items are like 
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Scope. The Otis Quick-Scoring Tests cover the age range from 
Grads, I through college. There are three batteries, as follows: 
Aipha Test (90 items) non-verbal; grades 1-4 
Beta Test (80 items) verbal, numerical, and spatial; grades 4-9 
Gamma Test (80 items) verbal primarily; high school and 
college 


Scoring. The Otis tests are easy to administer, and scoring is 
facilitated by a cutout stencil which can be superimposed on the 
test booklet. The tests are virtually "self-administering." There 
is a single time limit, which varies from twenty to thirty minutes. 
Mental age equivalents to total score are read from tables in the 
Manual. The Otis IQ's are deviation scores and are measures 
of brightnéss. These IQ’s are only generally comparable to 
Stanford-Binet IQ's; the two “scores” are not equivalent. The 
reliability of the Otis tests is high. 


4. Kuhlmann-Anderson Intelligence Tests* 


Description. This is a series of thirty-nine separate sub-tests 
` grouped into nine overlapping test batteries. The sub-tests include 
verbal and non-verbal materials. The early levels are entirely 
pictorial, but the tests become more verbal and abstract as we 
go up the age scale and finally are entirely verbal. Each test 
battery consists of ten sub-tests. 


Scope. Each of the ten batte-ies is printed in a separate booklet 
and is designed to cover one or more grade levels, as follows: 
Kindergarten: sub-tests 1—10 


Grade 1 sub-tests 4—13 
Grade 2 sub-tests 8—17 
Grade 3 sub-tests 12—21 
Grade 4 sub-tests 15 — 24 
Grade 5 sub-tests 19—28 
Grade 6 sub-tests 22—31 


* Published by the Personnel Press, Inc., 180 Nassau Street, Princeton, N. J. 
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Grade 7-8 | sub-tests 25 – 34 

Grade 9-12  sub-tests 30 — 39 ; 
Administration of the K-A tests is somewhat more difficult than 
with the Otis, since the tests in the batteries are often separately 
timed. K-A requires from 30 to 45 minutes to administer. i 


Scoring. In setting up a scoring plan, the authors of K-A have 
employed what is called the “median mental age” method. This 
may be described briefly as follows. Each of the ten sub-tests 
in a battery yields a mental age. These MA’s (see page 32) are 
chronological ages for which a given score is typical or average- 
Thus if the childrenswho are 10 years and 2 months old’ earn 
in general a score of 21 on a given sub-test, then rhe score of 21 
corresponds to or is equivalent to а MA of 10-2 on this sub- 
test. MA's are read from tables in the Manual. The wedian MA 
is the median of the ten sub-test MA’s.* This is taken to be the 
most representative measure of a child’s over-all ability. 

An IQ for the battery is found by dividing the median MA 
by the child’s life age, or CA. This IQ is not equivalent to the 
Stanford-Binet IQ, though it is related to it. The K-A tests 
measure verbal or abstract intelligence primarily, especially at 
the upper age levels. The reliability of the K-A—as shown by 
the stability of its test scores—is very high. Reliability coefficients 
have been computed for grades 1 though 9 separately: these range 


from .89 to .97. 


5. Terman-McNemar Test of Mental Ability** 


Description. This test is designed for high school students. It is 
a measure largely of ability to read and comprehend fairly diffi- 
cult prose. Two numerical sub-tests found in an earlier cdition 
of the test were eliminated in order to render the test more 
unified in content. As it stands, we have a highly verbal battery. 


n order of size, the point (or score) found 


* When ten scores are arranged i с po x 
her end of the зегїс is the typical value or 


by counting off five scores from eit 
median (see page 20). 


** World Book Company; Yonkers-on-the-Hudson, N. Y. 
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"There are seven sub-tests, described as follows: information, 
synonyms, logical selection, classification, analogies, opposites, 7 
and beşt answer. Sample items and instructions for each item 
type are shown in Figure 4-3. These items are easier than are 
the irems found in the test proper and are for illustration. Items 

in the test are graded in difficulty from easy to hard. 


FIGURE 4:3 Sample Items from tle Terman-McNemar Test 
of Mental Ability (Form C) 


P T3ST 1. INFORMATION 


Mark the answer space which ha» the same number ay the word that makes the sentence TRUE. 


Ѕамғъс. Our first Pe dent was 
1 Adams 2 Washington 3 Lincoln 4 Jefferson 5 Monroe ... 


TEST 2. SYNONYMS 7% 


Mark the answer space which has the same number as the word which has the SAME or moe? nearly 
the same meaning as the beginning word of each line. 


Saure correct—1 neat 2 fair — 3 right 4 poor 


TEST 3. LOGICAL SELECTION 


Mark the answer space which has the same number as the word which tells what the thing ALWAYS 
has or ALWAYS involves 


Saume. А cat always has 
1 kittens 2 spots. 3 milk 4 mouse 


TEST 4. CLASSIFICATION 


In each line below, four of the words beliing together. Pick out the ONE WORD which does not 
belong with the others, anit mark the answer spice bearing its number, 
ldog 2cat hose 4 chicken S cow. 1 
амри. | 
бһюр 7run Bstand 9 skip 


TEST 5. ANALOGIES 
Study the samples carefully, 


Ear is to hear as eve is to 


CO ‚ den 2ghees Зарр o 4 wink” Ssee.. 
Hat is to head as shoe is to 
бапа  7leg 61001 9%! 10 glove.......... E ) 
DO THEM ALL LIKE ТИЕ SAMPLES 
TEST 6. OPPOSITES 


Mark the answer space which has the sime number as the word which is OPPOSITE, or most nearly 
Opremite, in meaning to the beginning word of each line. 


Saume. aarth— 1 hot 2east 3west | 4 down У south 


TEST 7. BEST ANSWER 


you think is RLST. 


амтыг. We should not put a burning mutch in the wa-telu-ket because 
1 Matches cost money. 2 We might need a match later. k 


| 
Read exch statement and mark the answer space which has the sime number as. the answer which | 
3 It might go out. 4 It might start a fire. 


Reproduced by permission of the World Book Company. 
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Scope. The Terman-McNemar is planned specifically -for 


~ 
grades 7 through 12 and for college freshmen. 


Scoring. Total raw score is converted into a scaled score IQ, 
\ which is closely related to Stanford-Binet IQ. Scores may also 

be expressed as MA’s and as percentile ranks. The working time 
| for the test is about 50 minutes. Terman-McNemar comes in 

two equivalent forms. In the construction of the test, a careful 
| item analysis was made (page 214) in order to weed out unsatis- 
factory items. This is offered as evidence of the test's validity. 
The reliability of the Terman-McNemar is reported to be .90° 
for a single age level.” 

6. American Council on Education 

Psychological Examination* 

Description. This battery of tests is designed to measure scho- 
| lastic aptitude, ог learning ability in school. It comes in two 
forms, one for high-school students and another for college 
freshmen. The college test consists of six sub-tests, as follows: 


1. Arithmetic problems: 20 problems in multiple-choice form, 
of the “mental arithmetic” variety. 
2. Completion: 30 items in multiple-choice form. The test 
demands word knowledge and definitions. 
3. Figure Analogies: 30 multiple-choice items. Analogies in- 
| volve geometric forms, areas, angles, spatial arrangements. 
4. Same-Opposite: 50 multipleschoice items which demand 
ү vocabulary and word knowledge. 
5. Number Series: 30 items to be completed "logically" with 


appropriate numbers. | A 
6. Verbal Analogies: 40 items: relation-finding in verbal terms. 


In the АСЕ, college form, sub-tests 1, 3, and 5 are combined to 
give а quantitative, ог О, score; sub-tests 2, 4 and 6 are combined 
* to give a linguistic, or L, score. Each sub-test is separately timed 
and each is preceded by a practice exercise. In the high-school 


H«C 


* Published by the Educational Testing Service, Princeton, N. J.S 
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form of the ACE, tests 3 and 6 are dropped, leaving four sub- 
tests. Completion and same-opposite are combined to give the L 
score, and arithmetic and number series to give the Q score. 


Scope. The ACE is the most difficult of the general intelligence 
tests described so far. Testing time varies from about forty min- 
utes (high school form) to sixty minutes (college form). 


Scoring. The three scores from the ACE-—the quantitative, the 
linguistic, and the total—may be converted into percentile ranks. 
Extensive norms (in PR’s) are published annually covering test 
results from previous years. О scores have been found to cor- 
relate with achievement (grades) in mathematics and science, 
but the L score has the higher correlation with gencral achieve- 
ment in high school. 

‘The predictive validity (page 31) of the ACE, as determined 
over several years, is high. ACE correlates from .40 to .60 with 
college grades; its corrclations with Stanford-Binet av crage about 
.65. The rcliability coefficients of О, L, and total score are all 
very high. One feature of the ACE is the publication of norms 
for different groups. Separate norms are available for boys and 
girls and for three types of college—4-year, 2-year (junior), 
and teachers’ colleges. Although ‘the 4-year colleges achieve 
higher mean scores, ‘there is much ov erlapping of 4-year, 2-year, 
and teachers’ college scores. Differential norms are a distinct aid 
to educational counselors. 


HOW GROUP INTELLIGENCE TESTS ARE USED 
IN THE SCHOOL 


Survey Measures 


In general, the group test of intelligence is used (1) to give 
an over-all measure of a child’s abstract ability (often an IQ), 
(2) to provide a basis for educational counseling and guidance, 
and (3) to give a basis for prognosis. The total score on a group 
test is useful to the school administrator, the classroom teacher, 
and the parent. Standard tests supply the school administrator 


_ pupil is cap 
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with a systematic record of how different schools, and classes 
within a given school, compare in general ability to learn. The 
classroom teacher gets necded information concerning:the abili- 
ties of individual pupils. Within a given class, the spread of 
ability is often disturbingly wide. A teacher can tell from his 
test scores whether John and Mary are doing the caliber of 
work which can reasonably be expected of thern, and whether he 
is pitching his instruction at the comprehension level of the class 
as a whole. Parents can plan the future education of their chil- 
dren more intelligently when they know the level of perform- 
ance to bc expected.of them. And students can set their academic 
and occupational goals more realistically when they are aware 
of their strengths and weaknesses as shown by comparison of 
their scores with norms for their age level. 


Counseling, Guidance, and Prognosis 


The total score from a group test—the IQ or other type score 
—is most useful as a measure of a pupil's over-all academic ability. 
For guidance and counscling the tcacher can use to greater ad- 
vantage the sub-tests or part scores from the test Lattery. The 
profile of the California Test of Mental Maturity, for instance, 
has been especially designed for diagnosis. From the language 
and non-language IQ's, a teacher can judge whether a child is 
predominantly verbal-minded" or “object-minded”; and from 
the five "factor" scores on the profile he can judge how рго- 
ficient a pupil is in memory, logical reasoning, verbal concepts, 
spatial-perceptual relations, and numerical reasoning. In Figure 
4-2, for example, low scores in rcasoning and vocabulary indi- 
cate poor academic ability—that is, the pupil lacks the ability 
to solve problems efficiently by means of symbols (numbers 
and words). High scores in these factors reveal good academic 
aptitude and, when combined with other traits, suggest that the 
able of more advanced, and perhaps professional, 
in spatial relations reveal little promise of 


training. Low scorcs 1 | 
success in geometry, mechanical drawing, and perhaps manual 
g à 
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training. High scores here, plus high scores in the other factors, 
forecast aptitude for engineering and architecture. The memory 
factor is based on too meager a sample to provide a reliable 
measure of a pupil's functional memory; the score here might 
be significant, however, if very high or very low. 

Part scores, like those from the CTMM, are helpful in giving 

, the classroom teacher clues as to c child's abilities, But these 
scores must always be interpreted with caution (page 68). The 
sub-tests upon which such judgments are based are quite short 
and are often too narrow to permit of a broad prediction. Marked 
differences in part scores should always be substantiated by 
further investigation; they should jibe with other tests, with 
grades, and with the teacher's judgment from observation of the 
pupil's classroom work. 

The Otis Quick-Scoring Mental Ability Tests and the Kuhl- 
mann-Anderson Intelligence tests are primarily useful as over-all 
measures of the general ievel of inental functioning. The sub- 
tests of the Kuhlmann-Anderson are fairly complex. The authors, 
very wisely, do- not recommend that specific scores (mental 
ages) from sub-tests be interpreted as measuring definite psycho- 
logical functions. Wide variations in score from sub-test to sub- 
test for a given child may be significant, however, of gaps in 
training or in native ability. / 

The, Pintner-Cunningham Primary Test is most useful, per- 
haps, in helping the teacher and parent decide whether a child 
is mature enough mentally to do, first-grade work. Entrance into 
first grade should not depend solely upon the MA or CA, how- 
ever. Children who are babyish in their social behavior and 
poorly developed physically are poor prospects for first grade, 
no matter how high their IQ's. 

Because of its high verbal content, the Terman-McNemar 
Test of Mental Ability is one of the best predictors of high- 
school achievement. The homogeneity of the sub-tests (their 
high degree of relatedness) renders the test less useful for diag- 
nosis of a student’s strengths and weaknesses, The American 


g 
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Council on Education Psychological Examination is a good pre- 
dictor of college work. This battery measures initiative in attack- 
ing new problems, and mental speed and facility and goed work 
habits, as well as abstract ability. ACE is also useful in guidance, 
since it provides three scores—a quantitative (Q), a linguistic 
(L), and a total. The ACE for high-school students is used as 
а screening test for prospective college freshmea and as a basis 
for counselling high-schoól students who plan to continue their 
education beyond high school. The L score is perhaps most 
predictive of general college work, because of the great impor- 
tance of reading comprehension in college Courses. The Qjscore 
has predictive value for science and mathematics, especially when 
confirmed by other indicators (grades, teachers’ judgment). 

The ACE (page 91) provides separate norms for 2-year, 
4-year and teachers' colleges. The 4-year colleges have the 
higher average scores, but variation in score from one type 
of college to another is very large, as is also variation in score 
within each college type. A student's chances of entering college 
and staying there will depend to a considerable degree upon the 
college he chooses (see page 115 for discussion of local and 
nation-wide norms) or to which he is admitted. Only superior 
students should be encouraged to apply to high-standard col- 
leges, and not all of these are good risks unless they have the 
personal qualities to go along with academic potential. Good 
personality and a capacity for hard work may not, in them- 
selves, enhance a student's chaness of being accepted into an 
A-grade college, but they will help him stay there once he is in. 
Students with relatively low scores on ACE may be quite suc- 
cessful in colleges in which the scholastic standards are not too 
high. In any event, knowledge of his academic strengths and 
weaknesses should be helpful to a student, whether he plans 
further school work or not. 

Norms for the ACE for high-school students are based upon 
selected groups and may be much too high for all high-school 
seniors. In fact, Fis PR on the ACE may be unfair (misleadingly 
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low) for the high-school senior of modest intellectual endow- 
ment who does not plan to enter college. Such a youngster may 
rank well up among 18-year-olds in the population but relatively 
low among those of college caliber. 


Limitations of the Group Intelligence Test 


Intelligence tests have definie limitations, and teachers and 
parents must not expect the impossible from them. For one 
thing, a group intelligence test cannot increase intelligence, as 
parents sometimes seem to think it should. Again, a group test 
IQ is';ot necessarily a good measure of a pupil's drive to accom- 
plish, or of that dogged determination to stick to an unpleasant 
task and sce it through. Nor is a fairly high IQ (even a high IQ) 
always accompanied by emotional stability, good judgment, and 
initiative. All these traits are related to good intellect, but the 
relationship is by no means perfect. Many persons of average 
intellectual ability succeed in college, whereas many of greater 
potential fall by the wayside. Intelligence is a necessary, but 
is not a sufficient, attribute for high accomplishment in school 
or in life. 


WHAT TO LOOK FOR IN A GROUP 
INTELLIGENCE TEST 


The adequacy of a group intelligence test is judged by its 
validity, reliability, scoring methods, and norms. The object in 
giving the test, its cost, and such factors as time and personnel 
must also be considered. 


Validity. A test is valid, as we have noted, if it measures what it 
purports to measure (page 30). Group intelligence tests have 
been validated, in general, against various criteria judged to be 
indicative of intellect (page 31). Some of these criteria are 
school grades, ratings for ability, and other intelligence tests. 
All such criteria are admittedly indirect and fallible; at the 
same time, they represent measures with which any authentic 
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test of intelligence must correlate. Perhaps the best criterion of 
the validity of a group test is its success in predicting perform- 
ance in tasks judged to require intelligence—in school, in busi- 
ness, in the armed forces, or in a profession. Judged by correla- 
tional criteria and predictive power, most of the widely used 
group intelligence tests may be accepted as valid, though never 


perfectly so. 


Reliability. We have already had occasion to use the term 
reliability with reference to individual intelligence tests. If a 
child earns an IQ of 108 on one form of a group test and three 
months later achieves an IQ of 106, or 108, or even 110 on a 
second form—that is, scores within a few points of the first 
determination, and if most persons examined /show similarly 
consistent results, we regard the test as reliable. Reliability de- 
pends essentially upon the stability or consistency of a score. 
When properly given and scored, most standard group intelli- 
gence tests are highly reliable. 


Scoring. Group intelligence tests are first scored in arbitrary 
points, one or more points being assigned to each correct an- 
swer. Point or raw scores are frequently converted into MA's 
and IQ's. Such IQ's are related to, but are not equivalent to, 
Stanford-Binet IQ's. Group test IQ's are adequate for screening 
and often are satisfactory for guidance; but the individual intelli- 
gence test IQ is a more searching and more nearly constant 
measure of a child's talents (page 61). In addition to MA's 
and IQ's, many group tests also provide PR's for raw or obtained 
scores. These PR's are readily interpreted: they show how high 
the pupil ranks on a scale of one hundred points. If a high-school 
senior has a PR of 85 in the (L) part of the ACE and a PR of 
80 on the (Q) part, he should be a good risk for college work. 

А second way of rendering the scores from different sub-tests 
ina battery comparable is through the use of standard scores. 
Point scores may be converted into a standard score scale with 
a convenient mean and с. The sub-tests of a group intelligence 
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test usually differ in content, in length, and in difficulty. These 
part-scores cannot be compared—or combined—as they stand. 
But wher converted into a common scale, they can be added to 
give a total in which each sub-test has the same weight. 


Norms. Norms (page 40) are typical measures of achieve- 
ment. Norms may be nation- wide or local (page 115). Local 
norms are often fairer for a given group in that they take into 
account the conditions within a given city or state. National 
norms are most useful for wide comparisons and as standards at 
wnich о aim. Norme for college freshmen will generally be 
much too high for high-school graduates in general. College 


FIGURE 4-4 Norms for Various Occupational Groups on the 
Army General Classification Test 
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Reproduced by permission of Harper & Brothers. 
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freshmen are a selection of high-school graduates according to 
academic proficiency. Group intelligence-test norms are usually 
given in terms of age level, but they may be in terms, of grade 
level. 

Figure 4-4 shows the norms for certain occupations on‘ the 
Army General Classification Test, used in the armed forces in 
World War II. The higher scores are achieved by men with 
the most extensive training, and are probably the resultant of 
both intelligence and training. The more intelligent men are able 
to undertake the more exacting training, and this training enables 
their native talent to express itself. It is ifiteresting to псе the 
large degree of overlapping in score from one occupation to 
another. It seems evident that many men are functioning at a 
level below their native capacity. 

The educational expectation ofa child whose group test IO 
is 90, 100, or 115 may be read with sufficient accuracy for most 
purposes from Table 3-3. > 

Other Factors Which May Govern the Choice of a Group Intelli- 
gence Test. [n addition to the formal requirements to be met by a 
group test discussed above, there are other considerations which 
enter into the suitability of a test for a given school system. 
Among the more important are time available for testing, per- 
sonnel, cost, and acceptability. Catalogues provide data on cost 
and time allowances—most testing periods are set to fit com- 
fortably into a class period. In most cases, teachers can administer 
group tests with a minimum of instruction, and scoring can be 
done with stencils. Acceptability of a test depends on whether 
the teachers and the community look with favor upon standard 
tests. Much of the disfavor with which parents once regarded 
mental tests has fortunately disappeared, though one still encoun- 
ters skepticism as to their value. In initiating a testing program 
it is always wise to avoid tests which contain what appear to be 
trick items and those which resemble puzzles. Such tests are 
likely to be labeled frivolous by teachers and parents. Some 
parents still think that the object of a mental test is to describe 
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their children as dull or mentally abnormal. When they see the 
value of a standard test in providing a better understanding of a 
child's capabilities, their objections disappear. From the cata- 
logues listed оп page 253, the teacher or administrator should be 
able to find the test suitable for a given situation. 


SUGGESTIONS FOR FURTHER READING 


Cronbach, L. J. Essentials of Psycbological Testing. New York: 
Harper, 1949. 

Freeman, F. S. Theory and Practice of Psycbological Testing (Rev. 
edition’. New York: Holt, 1955. | 

Goodenough, F. L. Mental Testing. New York: Rinehart, 1949. 

Noll, V. H. Introduction to Educational Measurement. Boston: Hough- 
ton Mifflin, 1957. 


Thorndike, R. L. and Hagen, E. Measurement and Evaluation in 


Psycbology and Education. New York: John Wiley, 1955. 


SUGGESTIONS FOR LABORATORY WORK 


1. Administer three or four standard group tests of intelligence to the 
class and have the students score their own papers. If the test is for 
young children, cut the time limits in half. 

2. Select one of the tests taken in (1). Examine the Manual for the 
author's treatment of validity, reliability, scoring methods, and norms. 
Summarize these data. 

3. In another of the tests from (1), count the number of items which, 
in your opinion, are verbal, numerical, and spatial-perceptual. In which 
group did you do best? Worst? Does your result jibe with what you 
know about your abilities? ! 


QUESTIONS FOR DISCUSSION 


1. Is a group test of intelligence anything more than a scholastic 
aptitude test? What else does it add to your knowledge of a pupil? 

2. Why is the score on a reliable intelligence test usually a better 
estimate of a pupil's ability than is the rating of the.teacher? 

з. Why do we get different IQ's for the same pupil from different 
intelligence tests? 

4. Is the group test of intelligence more useful in an academic than 
in a vocational high school? 


P. 
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5. Suppose you are a sixth-grade teacher. You have administered a 

* standard group-intelligence test to your class. What uses do you think 
you might make from a knowledge of these children’s IQ's? 

6. A. pupil has taken the CTMM (page 84). In counseling this child, 

what help might you get from a wide difference in his language and. non- 


language 1Q’s? 


CHAPTER 5 


EDUCATIONAL ACHIEVEMENT TESTS 


"The purpose of the educational achievement test—like that of 
the ordinary school examination—is to discover how much a 
pupil knows about the subjects he has studied or is studying. 
Both the general intelligence test and the educational achieve- 
ment examination measure aptitude for school work ("abstract 
intelligence"). The difference between the two is one of empha- 
sis rather than of purpose. The intelligence test, as we have 
seen, tries to gauge mental alertness apart from specific school 
knowledge—that is, it is concerned primarily with the efficiency 
of mental processes as exhibited in problems which demand 
learning ability, perceptual keenness, memory, reasoning, and the 
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like. The educational achievement test is also concerned -with 
mental processes, but only insofar as they are demonstrated in 
a student's performance in English composition, arithmetic, his- 
tory, or science. 

The distinction between the two sorts of test is not always 
clean-cut, and there is much overlap in content and in abilities 
called upon. All intelligence tests depend in some degree оп 
previous learning, and all educational tests depend in some part on 
native keenness. Educational achievement tests predict future 
school performance as well as or better than intelligence tests. 
Achievement in the elementary school, for example, frecasts 
achievement in high school; and performance in =rithmetic pre- 
dicts later performance in algebra. But prediction is strengthened 
when an intelligence test is added to the achievement battery. 
Perhaps the general intelligence test is most useful when we want 
an estimate of potential aptitude, the achievement test when we 
want a measure of present school standing and probable success 
in later school work. Both tests provide valuable information, and 
each supplements the other. 

Educational achievement tests are useful (1) for survey pur- 

oses—that is, to determine a class's standing in relation to some 
norm, and (2) for guidance and evaluation—that is, to provide a 
clearer understanding of what individual pupils have learned—or 
failed to learn—in specific school subjects. A better understanding 
of strengths and weaknesses is a major objective of a testing pro- 
gram. Remedial work can be undertaken more intelligently and 
teaching improved when we know what errors a pupil is making 
consistently and what misconceptions and gaps in training led to 
these errors. À к. 

Achievement tests are often used for sectioning pupils in order 
to improve working conditions within the classroom. Thus pupils 
may be classified into high, average, and low ability groups on the 
basis of over-all educational standing, or sectioned within a 


grade into fast, medium, and slow learners. Predi.tion of later 
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school success on the-basis of educational achievement tests is 
consicerably more accurate than are forecasts based on conven- 
tional school marks. 


THE SUPERIORITY OF STANDARD ACHIEVEMENT 
TESTS OVER ROUTINE: EXAMINATIONS 


Standard áchievement tests are superior to teacher-made tests 
in three principal respects. 
Б 


1. Th: Achievement Test Is Better Planned. The usual teacher- 
made test in a'zebra or French is composed of questions and 
problems covering topics which one teacher believes worth know- 
ing about his subject. Usually materials are drawn from a single 
textbook. Sach a test is valuable as a measure of progress in learn- 
ing, but it is not very broad in coverage and does not permit 
comparisons with the achievement of students in other schools. 

The standard educational achievement examination, on the 
other hand, is compiled after an analysis of many widely used 
textbooks and various courses of study and sets of examinations. 
Thus it represents a consensus—the pooled judgment of many 
competent teachers and testing specialists. Drawing materials 
from many sources insures a representative sampling of subject 
matter. Occasionally a teacher will complain that a general 
achievement test contains questions about topics or books (in 
English literature, for example) which his class has not studied, 
and that on this account the test is unfair. This is often true, but 
the criticism is not as damaging as it may seem. Few classes have 
covered equally well all of the topics treated in a comprehensive 
achievement test. Some teachers will have emphasized one topic, 
some another, but by and large these inequalities will even ир for 
the test as a whole. Rarely will a school have a general and marked 
advantage (or disadvantage) over another school in educational 
experience, unless the teaching, the curriculum, and/or the caliber 
of the students are exceptionally good or poor. When gross 
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inequalities are revealed in test scores, the reason for such differ- 
ences should be sought. It seems hardly wise on that account to 
abandon the test. 


2. The Achievement Tes? Is More Objective. The standard 
achievement test is more objective than the teacher-made ex- 
amination. This means that in ar: achievement test, grades received 
by students depend to a minimum degree on the personal opin- 
ions, likes, and dislikes of the scorer. In the traditional essay 
examination, a high degree of subjectivity is almost inevitably 
present: the mark given an answer depends on what one ceacher 
regards as important and significant. > 


3. The Achievement Test Lays Down More Exact Specifications. 
The educationàl achievement test is more logically planned than 
the ordinary teacher-made examination, because makers of stand- 
ard tests draw up specifications for an examination. These lists 
are often lengthy and quite specific, but in general they can be 
reduced to two—knowledge and application. Thus test items are 
selected to reveal a pupil’s information and understanding of 
facts, as well as his acquired skills in, for example, reading or 
arithmetic. Again, items are chosen to-reveal a pupil's ability to 
apply known principles, to interpret, draw conclusions from 
given data, and solve problems. The second of these specifications 
is the more important, but the first is not to be dismissed lightly 
as being a matter of “mere memory.” Students cannot write 
good English prose, nor can they read difficult passages in history 
and literature, without adequate vocabulary. Even in so logical 
a subject as mathematics, a student cannot solve "originals" in 
geometry (no matter how bright he is) unless he knows the 
preceding propositions. Rote memory, of course, 15 rarely 
enough. The older spelling bées found how many detached and 
isolated words a child can spell—though often he had little idea 
of what the words meant. Modern spelling tests try to discover 
whether a child can spell a word and also knows Its meaning well 
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enough to use it correctly in a sentence—that is, in context. The 
secoad method (application as well as knowledge) provides a 
better measure cf a child's usable vocabulary. 


GENERAL EDUCATIONAL ACHIEVEMENT 
i BATTERIES 


The present section will describe five representative*achieve- 
ment test batteries (chosen from many) which are designed to 
theasure.- general educational achievement in the elementary 
grades «nd in the high school. 

1. The Stanford Achievement Test (SAT)* 

2. The Metropolitan Achievement Tests (MAT) 

3. The California Achievement Tests (CAT) 

4. The Cooperative General Achievement Tests (GAT) 
5. The Sequential Tests of Educational Progress (STEP) 


All these tests make some provision for the analytic study of a 
student’s strong and weak points through a comparison of sub-test 
scores. Part scores are often represented comparatively on a 


graph or profile. 


The Stanford Achievement Test (SAT)** 


Description. The SAT consists of overlapping sub-tests 
grouped at four ability levels from grade 2 through grade 9. АП 
four of the batteries contain three tests of paragraph meaning, 
word meaning and spelling (these are essentially measures of 
language skills); and two tests of arithmetic reasoning and 
arithmetic computation (number or quantitative skills). All these 
tests are multiple-choice in form. In addition to these five sub- 
tests, the Intermediate Battery (for grades 5 and 6) and the Ad- 
vanced Battery (for grades 7, 8, and 9) include four other tests: 
language, social studies, natural science, and study skills. The 


© These batteries are often referred to in abbreviated form by the capital 


letters. 
°° Published by the World Book Company, Yonkers, N. Y. 
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Language Test contains items in capitalization, punctuation, and 
sentence structure. The Social Studies Test covers fundamentals 
of history, geography; and civics. The Study Skills Test.is an 
ingenious attempt to discover how well a student reads maps, 
interprets graphs and tables, and uses references. This informa- 
tion is important to the teacher, since many pupils regularly skip 
all tables and graphs unless supervised. 

There are five forms-for each battery. The Primary Battery 
is printed in a single booklet of eight pages and takes a little more 


FIGURE 5-1 Sample Items from tbe Stanford Achievement 
Test, Primary Battery, Form K e 


Test I. Paragraph Meaning. 
Directions: "Find the one word that belongs in each space, and draw a line 
under the word. Do not write in the spaces.” 


Baby pets те. 
I drink milk. 
I say “Mew, mew.” 
lama 
Cow kitten pony child 


Test IV. Arithmetic Reasoning. 
Directions: "Now look at the pictures. Put your finger on the little chair 


in the top box. That is right. Next to the little chair are some candles. 
Put a cross on the shortest candle. Make a mark like this.” (Illustrate 


on the board, making a large X). 


"LLL 


“Оо you see the row of clocks? Put a big cross on the clock that says it is noon. 


Reproduced by permission of the World Book Company. 
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than two hours to administer. Figure 5-1 shows some of the items 
from the Primary Battery. The Elementary Battery (for grades 3 
and 4) cóntains six sub-tests: paragraph meaning, word meaning, 
spelling, arithmétic reasoning, arithmetic computation, and lan- 
guage. The Intermediate Battery requires almost four hours and 
will, of course, have to be spread over several class periods. The 
authors of the tests have drawn vp a convenient testing schedule, 
with approximate times for each sub-test. 


Scope. The scope of the SAT is as follows: 


1. Fximary Battery: end of grade 1, grade 2, and first half of 
grade 3-- 

2. Elementary Battery: grades 3 and 4 

3. Intermediate Battery: grades 5 and 6 

4. Advanced Battery: grades 7, 8 and 9 


These four achievement tests cover the fundamentals taught in 
most schools over the elementary grades through grade 9. 


Scoring and Nerms. All of the sub-tests are objective in form, 
so that scoring can be readily accomplished by stencils or scor- 
ing keys. Norms are in grade equivalents to raw scores, and also 
in percentiles for sub-test scores. 

There are two types of norms. The first, called the modal-age 
grade norm, is recommended for individual diagnosis, that is, 
for evaluating the scores of individual pupils. From tables in the 
Manual, a pupil's scores can be compared with those earned by 
children who are typical for age and grade. A second norm, the 
total-group grade norm, is based upon the performance of all 


i 


children in a given grade. These norms, given in tables in the |. . 


Manual, are recommended by the authors when one wishes to 
evaluate a class average. Raw scores оп the sub-tests аге con- 
verted into standard score units so that they may be combined 
and compared (page 38). 

The validity of the SAT is high. The tests possess content 
validity and the correlations of the batteries with grades and 
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other criteria demonstrate excellent predictive validity. The ге- 
liability of the various batteries is also satisfactory, 


The Metropolitan Achievement Tests (MAT)* 


Description. The MAT includes five test batteries with a range 
from grade 1 through the first half of grade 9. All of the test 
batteries contain sub-tests ОЁ reading and arithmetic, spelling is 
added after grade 2, and language usage after grade 3. At the 
intermediate and advanced levels there are ten sub-tests in all: 
reading, vocabulary, arithmetic fundamentals, arithmetic prob- : 
lems, English, literature, social studies (history), soci-l studies 
(geography), science, and spelling. In addition..to the complete 
batteries, partial test batteries are available for use at the inter- 
mediate and advanced levels. These include the skill subjects— 
reading, arithmetic, English, and spelling—plus vocabulary and 
arithmetic problems. All tests at a given level are printed in a 
single booklet. : 

The MAT provides а comprehensive survey of a pupil's educa- 
tional attainment. Moreover, the profile chart (see Figure 5-2) 
printed on the last page-of the test booklet and the class ability 
sheet allow the teacher to identify the student’s weak points, to 
correct errors consistently made, to study a pupil’s rate of 
progress from time to time, and to group pupils for instruction 
or review. Tests in arithmetic and reading are available as ѕер- 
arates and may be used when it is not feasible to administer the 
whole battery. à 


Scope. MAT includes the following batteries: 

1. Primary Battery I: grade 1 and beginning grade 2 

2. Primary Battery II: grade 2 and beginning grade 3 

3. Elementary Battery: grades 3 and 4 and beginning grade 5 

4. Intermediate Battery: grade 5 up to the first half of grade 7 

5. Advanced Battery: grade 7 up to the first half of grade 9 
MAT covers а wide range of material taught in grades 1-9. Test 


* Published by the World Book Company, Yonkers, N. Y. 1 


FIGURE 5-2 Profile Chart for the Metropolitan Achievement 
Tests 


INDIVIDUAL PROFILE CHART 
METROPOLITAN ACHEIVEMENT TESTS: INTERMEDIATE BATTERY —COMPLETE 
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batteries require from one hour (primary) to about four hours 
(advanced). 

Scoring and Norms. 'The MAT is easy to administer and to 
score. There are three types of norms: age, grade, and percentile. 
Norms are given also in a standard score scale which is based 
on the assumption of a normal distribution of test ability in the 
sixth grade. Standard or,scaled’ scores are comparable from bat- 
tery to battery in the same subject, but not from test to test within 
a given battery. Figure 5-2 shows the profile of George Fergu- 
son, who is 11 years and 8 months old. George is a sixth-gra.le 
student, and the MAT was administered on February. 6 when 
he was midway through the grade (that is, at:5.5 ). George’s 
scores on the ten sub-tests have been converted into age- 
equivalents from the appropriate tables in the "Key and Direc- 
tions for Scoring.” His subject ages (also called educational ages 
or EA’s) have been entered on the chart and joined by short 
straight lines to give the profile of his school achievement. A 
straight line drawn horizontally across the chart through George’s 
chronological age of 11-8 shows immediately in what subjects 
he is above or below the scores typical for his age level. 

George’s raw scores were converted into age- instead of 
grade-equivalents. These EA’s show whether George is acceler- 
ated or retarded as compared with children of his own age. EA’s 
are useful in guidance. Grade equivalents give the grade levels to 
which various scores correspond. A profile plotted from grade- 
equivalents tells us whether a pupil is above or below his present 
grade level in his various subjects. Both norms are useful. Grade 
norms are especially useful when comparisons with national or 
local norms are to be made; age norms are most useful when 
diagnosis of a pupil's strengths and weaknesses is wanted. 

Both the validity and the reliability of the MAT are satisfactory 


as judged by the usual criteria. 
The California Achievement Tests (CAT)* 


Description. The CAT have been organized into five batteries 


a. 
* Published by the California Test Bureau, Los Angeles, Calit. 
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designed to cover the ability range from grade 1 to college. The 
tests are survey in nature and are concerned primarily with skills 
in six areas: reading vocabulary, reading comprehension, arith- 
metic reasoning, arithmetic fundamentals, mechanics of English 
and spelling. The authors of CAT believe that tests in these areas 
are more valuable than are tests in such subjects as social studies, 
where the content varies widely. from school to school. The 
California Tests emphasize power rather than speed, the time 
required for the Elementary Battery being more than two hours. 
CAT stresses the use of the separate tests in diagnosis. Except 
in the case of spelling, for example, the tests in the six areas are 
subdiviced into sections, each dealing with some important aspect 
of the subject. For example, in the Elementary Battery, reading 
comprehension (Test 2) is analyzed into (1) following direc- 
tions, (2) reference skills, and (3) interpretation of material. 
"Test 3, arithmetic reasoning, is broken down into (1) meanings, 
(2) signs and symbols, and (3) problems. Scores from each of 
these sub-divisions are plotted on a profile like that of Figure 5-2, 
usually in grade-equivalent units. The analysis of a pupil's per- 
formance is carried still further by a second grouping together 
of items which presumably measure essentially common func- 
tons. Thus within the division of punctuation under Test 5, 
mechanics of English, items are grouped into those which in- 
volve commas, periods, question marks, quotation marks. Under 
the heading of addition in Test 4, arithmetic fundamentals, items 
are grouped under zeros, carrying, fractions, and decimals. Fhe 
number of item classifications under a given test varies from 50 
to more than 100. A special chart enables the scorer to analyze 
the pupil’s achievement over a wide range of these elements. 
Careful examination of specific item-groups may, to be sure, 
reveal why a pupil fails consistently to use decimals correctly 
or to understand fractions; or it may tell us where he is weak in 
punctuation, or in vocabulary, or in spelling. The CAT at least 
makes an attempt to keep the individual pupil from being lost 1n 
an "average." At the same time, it must be remembered that 1n- 


e 


ү 


E 


"5 


Co-operative General Achievement Tests (САТ) 113 


dividual diagnosis based on a few items is always tentative and 
may be misleading. 


Scope. The CAT consists of the following batteries, "which 
cover the educational levels described. 


1. Lower Primary: grades 1 and 2 
2. Upper Primary: grades 3 and 4 
3. Eleméntary: grades 4, 5, and 6" 
4, Junior High: grades 7, 8, and 9 
5. Advanced: grades 9 to 14 


4 


Scoring. Raw scores on the tests may be converted, iy tables 
into age, grade, and percentile-within-grade norms: THe sub-tests 
are objective in form, easy to administer, and easy to score. The 
six tests of the batteries have satisfactory reliability, but the re- 
liabilities of the various sub-divisions are quite low because of the 
few items included in some groupings (often only one or two). 
Validity is high for the whole test. ' 


Co-operative General Achievement Tests (САТ)* 


Description. These achievement tests deal with three fields or 
areas—Test I covers social studies, Test II, natural sciences, and 
Test III, mathematics. Each test battery consists of two divisions: 
Part I, which deals with fundamental terms, concepts and defini- 
tions; and Part II, which covers applications of knowledge, in- 


hension. The battery has been planned 


terpretation, and compre i | 
for grades 10, 11, and 12, but it is probably too difficult for all 


but superior tenth and eleventh graders. The battery is objective 
in form throughout. 
Scope. GAT is a po 


grades and for college 
60 minutes. 


Scoring. The tests аг | 
administer and to score. Items are graphic, 
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wer test designed for the upper school 
freshmen. Each test requires from 40 to 


e all multiple-choice, and are easy to 
pictorial, and verbal. 
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Norms in scaled scores and percentiles are given for high-school 
students and college freshmen. GAT is probably most useful in ¥77 | 
the«counseling of high-school students as to the subject fields in 
which they shew the greatest promise. 


The Sequential Tests of Educational Progress (STEP)* 


Description. As the term "sequential" implies, this battery is 
designed to measure a student's progress in learning as he goes 
from the elementary grades to college. The tests deal with 
critical skills in seven academic areas: essay tests, planned to 
provide standardized tests in writing prose; listening compre- 
hension tèsis, in which the examiner reads a passage and asks ques- 
tions designed to call out comprehension, interpretation, and 
evaluation; reading tests, covering a wide range of content; writ- ~ | 
ing tests, planned to measure the student's ability to express ideas; | 
mathematics tests, which contain items over a wide range of 
subject matter and difficulty; science tests, dealing with the appli- 
cation of scientific knowledge to a variety of situations; and 
social studies tests, designed to show progress in social and civic 
development. 


Scope. STEP is designed to measure achievement over the 
following levels: | 


Level 1—freshmen and sophomore years of college | 
Level 2—grades 10, 11, and 12 
Level 3—grades 7, 8, агі 9 
Level 4—grades 4, 5, and 6 


It should be noted that Level 1 is the highest level academically, 
Level 4 the lowest. STEP attempts to reveal continuity in mental 
growth and learning from the bottom to the top level. 


a 


Scoring and Norms. There are two equivalent forms (A and B) A 
of each test in STEP except the essay tests, for which there are 
four forms. There are grade and percentile norms. Scoring is by 
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stencils. A profile chart allows the examiner to analyze a pupil's 
performances on the several functions measured by the battery. 


GENERAL ACHIEVEMENT TESTS 
IN THE SCHOOLS 


We have seen how tha general educational achievement test 
gives the academic level of a pupil or of a class, and how the test 
profile reveals strengths and weaknesses in a variety of subjects 
and processes. Further iliustration of how, educational achieve- 
ment tests may be utilized in (1) evaluation, (2) diagng.is, and 
(3) prediction will be given in this section. des 


Evaluation. Suppose that Miss Clark has given the SAT to her 
sixth-grade class of twenty-six pupils. She finds her class mean 
(average) on the test battery to be about equal to the local norms 
for the sixth grade, but slightly below the national norm as given 
in the Manual. Does this result mean that Miss Clark is doing a 
poor job because local norms are less valuable than national? The 
answer is No, since a number of factors affect achievement in a 
given school system or a single school, and some of these may 
cause local norms to be lower or higher than national. Among 


these factors are the following: 


1. Retardation as a consequence of strict promotional stand- 
tices. Much rezardation will lower local norms, 
eeding out of poor students (by transfer to 
for example) will raise local norms. 

rrespective of achievement. This fairly 
ll lead to a progressive lowering of 


ards and prac 
whereas the w 
special classes, е 
2. Promotion by аде! 
common practice wi 


local grade norms. oS HT 
3. Previous experience of pupils with standard objective tests. 


This factor varies widely and often affects local norms. 

4. Coaching in the tests themselves. Sometimes teachers coach 
pupils in materials akin to or identical with those found in 
the tests. “Teaching for the tests" is bad practice and should 
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be discouraged whenever possible. Coached pupils usually 
raise the class's performance. 

5. Selection. Children from a poor socio-economic back- 
ground generally score lower on standard tests, whereas 
children from good neighborhoods score higher, especially 
on the verbal tests. Е 

6. Motivation. Children do not try hazd on tests if the teacher's 
attitude is negative, or if the parents think achievement 
tests are worthless—and say so loudly and often. 

7. Transfers, drop-outs. These children may affect local 
norms, usually adversely. 


In some private schools in which pupils are generally of high 
caliber because of stringent selection procedures, local norms will 
often be found to be considerably above national norms based 
cn public school results. In a large city system we can expect an 
occasional sixth-grade class to fall below national norms even 
when the city as a whole is up to national standards. But when 
a number of classes fall below national standards, the curriculum, 
the teaching methods, the promotional standards and other con- 
ditions in the school and the community should be examined. 


Diagnosis. In looking over her test results for the sixth grade, 
Miss Clark may find that Harry is far below the sixth-grade norm 
in reading and that Sue is below the norm in arithmetic. At the 
same time, Mary reads at eighth-grade level, and John (the 
youngest child in the class) is up to the ninth-grade norm in 
science. Individual differences like these are the rule rather than 
the exception in most elementary classes. It is fairly easy for Miss 
Clark to prescribe further reading for Mary, and to stimulate John 
to carry out an individval project in science—for example, classi- 
fying the birds in the loca] community. The below-average chil- 
dren often present real problems, and as a result they are given 
more of the teacher's time and effort than the bright children. 
If the extra time which Miss Clark can devote to Harry and Sue 
is insufficient co bring these children up to the sixth-grade levels 


> 


u 


0 
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in reading and arithmetic, they should be referred to special 
classes, if such are available. The larger the number of below- 
average children, the more difficult is Miss Clark’s task, and the 
more likely she is to neglect-the bright children. 

It should be noted, as a further point, that the printed norm 
(local or national) does not necessarily establish the optimum 
level of performance for every püpil in the sixth (or any other) 
grade. If Norman, whose IQ is 120, is just on the sixth-grade 
norm in reading and arithmetic, he is not performing up to ex- 

ectation—his scores should be above the norm for his grade. On ` 
the other hand, if Bil?, whose IQ is a modest 94, is at ог above 
the norm for the sixth grade in reading and aritlimetic, Һе is 
actually doing better than we can reasonably expect of him. The 
intelligence of the child must always be considered in deciding 
whether his school work is “normal” for the grade. 

Sometimes Miss Clark will suspect from a pupil’s sullen be- 
havior, or open aggressiveness, or his tendency to whimper at 
the slightest provocation that emotional factors are causing or 
contributing to his difficulties in school. Such a pupil should be 
referred to the school psychologist (if there is one) or to the 
school physician. The clinical psychologist is often able through 
tests and interviews to get a clearer idea of a pupil's difficulties 
than can the teacher. The teacher should visit a child’s home if 
she suspects that parents and home environment are involved, as 
they often are. Corrective measures (when possible) can be more 
intelligently applied when causal factors making for undesirable 
conduct and/or poor school work are known, rather than sur- 
mised from superficial impressions. 

Prediction. Whether it will be profitable for a student to take 
science or mathematics in high school or college can be forecast 
with considerable assurance from his performance on standard 
tests, Prediction of later success is usually improved when tests 
given in elementary schools are combined with a good intelli- 
gence test. Intelligence and achievement tests are regularly 
utilized in many schools in the selection and placeinent of stu- 
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dents in courses of study. The combination of achievement tests 
and special aptitude tests is valuable in predicting a student's 
success in a professional school—in law or medicine, for example. 


ACHIEVEMENT TESTS IN SPECIAL 
SUBJECT AREAS 


In the preceding section, we described five general achieve- 
ment batteries designed to assess academic standing in school. In 
the present section, we shall consider several representative sub- 
ject-matter achievement tests. These include tests of reading and 
arithmtsig, 3s well as tests planned to determine mental maturity 
(readiness) and proficiency in special subjects. Of the various 
subject-matter tests, those in reading and arithmetic are most 
often given, since they represent fundamental skills upon which 
school achievement largely depends. Subject-matter tests are 
found, of course, in the.general achievement batteries, as wel] as 
in separate form. The tests listed below were selected as being 
typical of a very large number available. 


Metropolitan Readiness Tests 

Iowa Silent Reading Tests 
Co-operative Mathematical Tests 
Evaluation and Adjustment Series 
Co-operative French Test (elementary) 
Co-operative Science Test 


Anew Ne 


Metropolitan Readiness Tests* 


Description. The primary objective of these tests is to find 


whether a child is sufficiently mature to undertake the study of 
reading. But the tests are concerned also with “readiness” for 

- arithmetic, and with general physical and mental maturity. The 
six tests in the battery may be described as follows: 


(1) Word Meaning: child selects picture named by the ex- 
aminer. 
* Published by the World Book Company, Yonkers, N. Y. 
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(2) Sentences: same as (1) except that the examiner ses 
sentences and phrases instead of single words. 

(3) Information: child marks the picture corresponding to the 
examiner's oral description. : 

(4) Matcbing: child must recognize similarities and differences 
in pictures, geometrical forms, numbers, letters, words. 

(5) Numbers: child must demonstrate a knowledge of number 
concepts and carry out simple operations. 

(6) Copying: child is required to copy simple graphic forms, 
as well as numbers and letters. 


All the test items àre pictorial—that is, non-verbal. Эле test 
has two forms. The battery is essentially a prognostic test: its 
purpose is to forecast a child’s mental, sensory-motor, and mus- 
cular readiness for first-grade work. Figure 5-3 shows sample 


items. 


Scope. The test is for the end of kindergarten and the begin- 
ning of first grade. The test requires about sixty minutes working 


time. 


Scoring. Norms in percentile ranks allow the teacher to estimate 
a pupil’s readiness for reading (based on tests 1-4), readiness for 
arithmetic (test 5), and general maturity for first-grade work 
(tests 1-6). In addition, a child’s score is given a rating from A to 
E. An A rating denotes an excellent risk, the other letters a lesser 
degree of certainty down to E,-which implies almost certain 
failure. 

Prognostic Value of the Metropolitan Readiness Tests. The test. 
battery as a whole forecasts general maturity for the first grade, 
but its sub-tests may be used diagnostically to provide informa- 
tion about individual children. If Ben makes low scores on tests 1, 
2, 3, and perhaps 4, for example, he has inadequate maturity in 
language for first-grade work. Or he has too little experience 


With and comprehension of language generally. If Louise earns 
low scores on tests 4 and 6, she is probably too immature to under- 
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FIGURE 5-3 Sample Items from Metropolitan Readiness T'ests 


Test 1. Word éeaning. In the first row, the child marks the baby; 
in the second row, the house. 


yO? ES 


a aga 


Test 4. Malching. In each row the child circles the picture identicol 
10 the one in the circular frame. 
Reproduced by permission of the Worid Book Company. 


c 


take written work. As these two tests measure visual perception 
and hand-eye co-ordination, an cye examination and training in 
motor skills may 


be indicated. Test 5 (numbers) shows readiness 
for number work, and the child who scores high should be able 
to use numerical symbols. Test 6 (cop 


ying) has proved to bea 
good measure of physical and mental maturity. From this test, 
the teacher can pick up tendencies to reversals in drawing and 
writing, phenom 
has not developed reading readiness by age 7/4, he should be 
examined by a physician, an oculist, and perhaps a psychologist. 


ena fairly common at this age level. If a child- 


Р 


& 
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lowa Silent Reading Tests* 

Description. This test consists of two batteries, one for elemen- 
tary schools and one for high schools and colleges. Both Batteries 
measure reading rate, vocabulary, sentence comprehension, 
paragraph reading, and skill in locating information. Speed is an 
element in the battery, as well as power. The Elementary Test 
includes a reading comptehensioñů test called “directed reading,” 
and the Advanced Test a test of poetry comprehension. 


Scope. The two batteries cover the following range: 


Elementary Test (four forms)—grades 4-8 , 
Advanced Test (four forms)—high school and coliege fresh- 


men p 
Working time for either battery is about 50 minutes. 
Scoring. There are six sub-tests in the Elementary Test: 


Rate and comprehension in reading connected prose. 
Directed reading of prose to get answers. 

Vocabulary and work meaning. 

Paragraph reading: selecting the main idea and adding 


appropriate details. 
5. Sentence meaning: understanding brief sentences out of 


Фм к 


context. P ] у 
6. Work-study skills: alphabetizing and using an index. 


Test 1 yields two scores (rate and comprehension) and Test 6 
two scores (alphabetizing and use of index). Tests 2, 3, 4, and 5 
yield one score each. These 8 sub-scores are converted into scaled 
tables appended to each test. Scaled scores 
show the variations in perform- 
provided by grade for each sub- 
d grade equivalents to 


Scores by means of 
may be plotted on a profile to 
ance. Percentile norms are also 
test and for total score. There are age an 


total score. 
The Iowa Test can be expe 


* Published by the World Book Com 
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reader, (b) the careless reader who fails to follow directions, |. 
omits necessary details, and skims over important facts, and (c) 7 
the rapid but uncomprehending reader. 


Co-operative Mathematics Test for Grades 7, 8, and 9* 


Description. This test consists of four parts: I, skills; II, facts, 
terms, and: concepts; III, applicatioas; IV, appreciation. Ques- 
tions and problems cover basic arithmetic as well as simple algebra 
and geometry. The test may be used for survey purposes, but it 
is perhaps more valuable in evaluation and guidance. Sample items 
fromthe test are shown in Figure 5-4. 


Evaluation and Adjustment Series (High School)** 


Description. This is an extensive battery of subject-matter and 
other tests (twenty-four so far and more to be added) designed 
for use in high schocls. The tests cover such traditional areas 
as algebra, biology, geometry, physics, history, and literature. In 
addition, there are tests of reading comprehension, “problems in 
democracy,” health knowledge, and study skills. The content of 
the tests has been drawn from standard textbooks, courses of 
study, and professional literature. Tests may be administered as 
separates or as parts of a general survey. 


Scope. For survey and diagnosis in grades 9 through 12. There 
are two forms for most tests. 


Scoring. Raw scores are converted into scaled “cores for each 
vest, so that comparisons may be made from test to test. Results 
may also be compared graphically by means of a profile. Many 
of the tests provide charts showing what score is to be expected 
at given IQ levels. IQ's are, from the Terman-McNemar Test о 
Mental Ability. The reliability of the various tests in the battery 
is satisfactory. The separate tests require from 45 minutes to an ^ 
hour of working time. : 3 


* Published by the Educational Testing Service, Princeton, N. J. 
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FIGURE 5-4 Sample Multiple-Choice Items from the Coopera- 
- tive Mathematics Test for Grades 7, 8, and 9 


From Part 1, Skills: 
39. 416 equals 
ENS 


From Part ll, Facts, Terms, and Concepts: 


7. Which òf the following is a unit in the 
melric system? тей 
7-1 Оипсе 
7-2 Centimeter 
7-3 Үага 

7-4 Bushel 

7-5 Соз... " ИВЕ. 


From Port 111, Applications: > 
24. lfaman spends 12% of his salary on bonds, 
and buys a $37.50 bond cach month, what 
- jshis monthly salary? 
24-1 $312.50 
24-2 $312.60 
24-3 $350 
24-4 $376.20 
24-5 $450... eee 240 ) 


From Port IV, Appreciation: 
20. Which of the following has no volume? 
20-1 Cylinder 
20-2 Cone 
20-3 Square 


20-4 Cup 
20-5 Rectangular box . «57 * 


е 


Reproduced by permission of the Educational Testing Service. 


Co-operative French Test (Elementary)* 


Description. The specifications for 
of French grammar and voca 


* Published by the Educational Testing Service, 


Princeton, N, J. 


this test call for knowledge 
bulary, plus the ability to use the 


the translation о 


FIGURE 5-5 
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From Part 1, Informational Background: 


3. It is believed that dinosaurs lost out in 
their struggle for existence chiefly because 
3-1 they were killed by man for food: 
3-2 man could not tame them. 

3-3 .they were not adapted to changes 
that took place in the carth's sur- 
face and climate. 

34 they were not fitted to cat plant food. 

3-5 they had no brains. ........ 


From Part Il, Terms and Concepts: 


2. The instrument used to look at and study 
the surface of the moon and the planets is 
the 


2-1 galvanoscope. 

2-2 microscope. 

2-3 telescope. 

2-4 electroscope. 

2-5 гайіотеќег.. . . . s.s.. 2 


13. If two plants of the same species but of 
different varieties are mated, the offspring 
are called 
13-1 mongrels. 

13-2 sports. 

13-3 biennials. 

13-4 lentils. 

13-5 hybrids. . . +--+ + TTE 


Part Ill, Comprehension and Interpretation: 


This test consists of multiple-choice items to be answered after reading 
а paragraph of scientific prose or examining a table. The selectio 


be understood and interpreted. 


Reproduced by permission of the Educational Testing Service. 


language in reading and translation. The test has three parts: 
vocabulary, grammar, and reading. The vocabulary section is a 
multip'e-choice test of fifty words. Grammar (thirty-five items) 
uires the selection of one of five choices to complete correctly 
f an English sentence into French. In the reading 
section, forty incomplete sentences in French are to be completed 
from a list of five options. The reliability of the test is high. 


Sample Multiple-Choice Items from tbe Coopera- 
tive Science Test for Grades 1, 8, and 9 


V 


| 
| 


_ 
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Scope. This test is intended for the first two years of high 
school or for the first year of college study of French. 


Scoring. Scaled scores are provided for each of the three parts 


` of the test and for the total. There are percentile rorms for high- 


school and for college classes. Working time for the test is forty 
minutes. 


` 


Co-operative Science Test (Grades 7, 8, and 9)! 


Description. There are three parts to this test: Part I, informa- 
tion and background; Part II, terms and concepts; Part Ш, com-. 
prehension and interpretation. The test is planned to measure 
knowledge and application. Part II is in multiple-^-—.e form. 
Part Ш consists of readings in science, each rcading followed by 
questions designed to assess the student's understanding, as well 
as his ability to interpret and apply what he has read. (Figure 5-5) 

Scope. Grade 9 and superior seventh and eighth graders. 

Scoring. There are scaled scores for the three parts and for the 
total. Percentile norms are given for grades 7, 8, and 9. The 
working time for the whole test is about eighty minutes. Re- 
liability of the whole test is high. 


WHAT TO LOOK FOR IN AN EDUCATIONAL 
ACHIEVEMENT TEST ` 


ducational achievement test for a given 
е . B B *3* 
situation must be determined from an examination of its validity, 
its reliability, its scaling techniques, and its norms. The cost, 
time, and personnel needed to administer and score the tests must 
2 B 

also be considered. These same requirements apply to group tests 
of intelligence. Each of the main characteristics of a mental test, 
except perhaps validity, has been commented on at NUES. 

i ta 
places throughout this chapter. A summary of the relevant data 


under each category will now be offered. 
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Validity. An educational achievement test is valid when it 
measures what it undertakes to measure. Most subject-matter 
tests possess content validity. An arithmetic test or a geography 
test or 2'reading test, for example, is valid by definition when it 
contains a sampling of arithmetic problems, geography questions, 
and peragraphs to be read. The standardized educational test is 
made up of items taken from 2 varicty ofsources: widely used 
textbooks, courses of study, exasnination questions, and outlines. 
The items in tentative form are checked by experienced teachers 
and are put into objective form by test construction specialists. A 
vroad selection of items insures a comprehensive sampling of 
materials. 

One valiaation technique employed in some educational tests 
is the following. The test is provisionally drawn up and is admin- 
istered to an experimental group; only those items~are retained 
which show an increasing percentage passing with age or with 
grade. Other techniques, of item analysis will be described in 
Chapter 9. All of these procedures are directed toward selecting 
questions which will work together as a team, cover a wide range 
of difficulty, and be related closely in content (be homogeneous). 
The standard test, when finally made, is a compact and closely 
knit instrument for measuring what it purports to measure. Data 
on validation procedures will be found in most Manuals which 
accompany standardized achievement tests. 


Reliability. The reliability of the educational achievement tests 
described in this chapter has bcen generally reported as high. 
This means that parallel forms of the test correlate highly (over 
.90 in most cases) so that we may have confidence in the stability 
of a child's score. In most test Manuals, reliability is expressed by 
the “reliability coefficient,” also called the scff-corrclation of the 
test, or by the standard error of an obtained score. The correla- 
tion of a test with itself (by retest) or between alternate forms of 
the same test tells us how closely the pupils’ scores “stay put.” 
The standard error of a score tells us how much fluctuation to 


qu 


x? 
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expect in a child's score upon retest. If the standard error is three 
points, for example, the odds arc two to one that Bill's score of 64 
will, on a second trial on the test, vary up or down from the first 
determination by not more than three points. The-smaller the SE 
of a test score, the greater the stability of the obtained score. The 
SE of a test score gives us more information concerning reliability 
than does the reliability coefficient alone (page 29). 

Scores obtained оп most standard tests are highly stable, but 
part scores based on a relatively few items are variable and may 
be quite unreliable. Conclusions as to strengths. and weaknesses, 
based on unstable scores are always tentative, and must be re- 
garded as suggestive only. "if 


Scaling. Most educational achievement tests are first scored in 
arbitrarily assigned points, so many points being given for a 
correct answer. These point scores are usually converted into 
scaled scores by means of tables printed at the end of the sub- 
test. The meaning of standard scores and of T-scores has been 
discussed in Chapter 2. Raw or obtained scores (point scores) 
from the sub-tests of a battery differ in length, difficulty, and 
content; they cannot be compared or combined as they stand. 
When scaled, scores expressed in different units are comparable. 
Scaled scores—and sometimes raw scores—are usually converted 
into age and/or grade equivalents—into the age and grade values 
which correspond on an average to the given scores. If the 
average child of 9 years and 4 months earns a score of 38 on an 
Arithmetic Fundamentals Test, then the score of 38 “equals” an 
educational age (EA) of 9-4. If children who are half way 
through the seventh grade (that is, at 7.5) earn a mean score of 
63 on a Reading Test, the score of 63 has a grade equivalent of 


7.5; : 
The educational age (EA) may be divided by the chronological 


age (CA) to give an educational quotient (EQ). (ко Lm 


This EQ is a measure of acceleration and is sor. ewhat analogous 
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to the IQ. The EA and EQ are often useful, provided they are 
taken to refer only to the tests on which they are based and are 
not thought of as general indices. 


Norms. Norins are typical measures of performance. Ín a 
standard educational test, the mean score made by a large and 
representative group of fifth-grade pupils is the norm for fifth- 
grade children on this test. Norms are expressed in age and grade 
equivalents, as percentile ranks, and in the form of scaled scores. 
A child’s grade placement is found by computing the tenths of 

~ the school year "which have passed before the test was given. If 
the school year begins about September 1 and ends June 15, a 
sixth-graüx lass tested in the period between March 16 and April 
15 is assigned the grade position of 6.7—the class is 7/10 into the 
school year. Most standard educational achievement tests report 
nation-wide norms in their Manuals. These typical performances 
are based on the achievements of large groups of children from 
all over the country. As we have pointed out, local norms (for 
city or state ог both) are often better measures of pupil achieve- 
ment. Any pupil's scores relative to those of other pupils should 
be evaluated in terms of his effort, his intelligence, and his home 
and community. 


Other Factors in the Selection of a Test. The cost of a testing pro- 
gram, the personnel required, and the time it will take from other 
s oolactivities—all these must be considered in adopting a given 
test or tests. Tests which fit easily into a class period, which can 
be scored objectively (by means of stencils) by a clerk, and 
which are acceptable in form and content to teachers and to 
parents are in general least disruptive of the school’s routine. 


SUGGESTIONS FOR FURTHER READING 
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Greene, Н. A., Jorgensen, A. N., and Gerberich, J. К. Measurement 
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Longmans, Green, 1953. : 
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Jordan, A. M. Measurement in Education. New York: McGraw-Hill, 
1953. . d 

Traxler, A. E. et al. Introduction to Testing and the Use of Test Results 
in Public Scbools. New York: Harper, 1953. vA 


SUGGESTIONS FOR LABORATORY WORK 


1. Administer two or three,standardized achievement tests to the 
class, cutting the time to one-haif if necessary. Have students score 
their own tests and plot profiles where called for. 

2. Analyze a standard reading test, listing the objectives which you 
think the author had in mind. Do you agree that these objectives wer~ 
fulfilled? = 5 

3. Select a test taken in (1). Consult the Manual for data ^a validity, 
reliability, scaling procedures, and norms. 


QUESTIONS FOR DISCUSSION 


1. For which of the following purposes would a standardized achieve- 
ment test be useful: 

(1) To discover which pupils have not mastered multiplication and 
division of fractions. 

(2) To determine which pupils are reading too slowly. 

(3) To determine for the class which punctuation skills need further 
work. 

(4) To section the class into two groups for teaching arithmetic. —— 

(5) To discover the subjects in which each pupil is strong and in 


which weak. 

2. A teacher lists the following as О 
civics: 

(1) To present facts in the field. * ^ S і 

(2) То ргераге the class for the duties of citizenship. 

(3) To further appreciation of democracy. 

(4) To foster criticism of governmental processes. 

(5) To aid pupils in thinking about problems in government. 
Which of these objectives is the teacher most likely to fulfill? 

3. The Manual of Test ABC states that the test may be used for 
diagnostic purposes. What do you look for in a test to determine whether 


it has diagnostic value? 

4. А ЗА of English states that batteries of standard tests tell the 
English teacher nothing that cannot be better found out from a theme 
and ап interview. Do you agree? 


bjectives of a course in history and 
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5. Why is it necessary that sub-tests in a battery to be used for diag- 
iosis have high reliability? 

6. In some schools, teachers prepare for 2 testing program by having 
tudents réview older standard examinations. What effect could this 
nave on the students’ morale? On the comparability of test results from 
school to school? Is it good educational practice? 

7. In School A, the pupils in grades 4 to 7 are given the California 
Achievement Tests. Scores are recorded in grade equivalents only. What 
other types of scures would be valuable? Why? 

8. The Manual of a reading test reports a correlation of .40 with 
English marks in the first year of high school. Is this good evidence of 
~-lidity? Discuss. 

9. Suppose that the Metropolitan Achievement Tests have been admin- 
istered in grede, 5 in October. How might you, as the teacher, use the 
results of the test? 

10. For what predictive purposes would it be desirable to have the 
results from the following tests: 

(1) A test of ability to read difficult scientific prose drawn from 

various fields. ‹ 

-(2) A test of skill in graramar: punctuation, capitalization, sentence 

structure, and so on. 

11. How could you use the results from a group intelligence test to 
supplement scores made by your pupils on an achievement battery? 

12. Is it important to have tests of speed, as well as of power? 
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СНАРТЕК 6 


APTITUDE TESTS ` 


When a youngster possesses traits and abilities which enable 
him to speak French readily, acquire mathematics, deal handily 
with tools, or play a musical instrument well, he is said to have 
aptitude for the given activity. Aptitudes are probably inherited 
basically, but they cannot appear unless the environment is 
favorable—that is, unless the opportunity is provided. Very often 
some training, often a great deal of it, is necessary, too, before an 
aptitude reveals itself in performance. ud 

Aptitude tests are not essentially different in form or incon- 
tent from intelligence and educational achievement tests, since 
all mental tests are in reality measures of aptitude. Intelligence 
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tests measure capacity for school work and for vocations requir- 
ing school training; and achievement tests measure proficiency 
in Englisn grammar, mathematics, science, and other subjects. 
Perhaps the chief difference between these tests and those de- 
signed to measure aptitudes is the fact that an aptitude test is 
concerned almost entirely with the future—with prognosis. Thus 
an engineering aptitude test is used typically to forecast an ex- 
 aminee's chances of success in engineering. The aptitude test 
alone is, of course, rarely able to provide a wholly satisfactory 
—vwstimate of probable performance later on. For an individual's 
efforts to be maximally effective, aptitude must be supple- 
mented bj^uaining. Furthermore, the examinee must possess 
initiative, interest in the job, and favorable personality charac- 
teristics. 

We have classified aptitude tests under four heads: (1) general, 
(2) special, (3) professional, and (4) talent. The two best-known 
general aptitude batteries are those designed to assess aptitude 
for (a) mechanical tasks, and (b) for clerical work. Many special 
tests (of speed, co-ordination and reaction time) have been de- 
vised to measure aptitudes believed to be crucial in industry. 
Achievement tests, too, are employed as aptitude tests to reveal 
an examinee’s performance in languages or mathematics, for 
example, and hence provide a measure of his promise in more ad- 
vanced courses. In the field of professional work, aptitude test 
batteries have been assembled to assess the traits believed nec- 
essary for success in medical school, in law, in engineering 
and in teaching. Aptitude in music and art is generally called 
talent, and tests are available to forecast achievement in these 


fields. 
GENERAL APTITUDE BATTERIES 


The general aptitude battery attempts to forecast probable 
success in a number of related tasks or vocations by sampling 2 
wide range of behaviors believed to be involved in the activity. 
In this section, two batteries designed to measure aptitude for 
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mechanical work are described, together with two batteries 
planned to measure aptitude for clerical proficiency. _ 


Mechanical Aptitude 


The term "mechanical aptitude" includes a variety of behav- 
iors. One of the earliest mechanical aptitude tests consisted of 
a box containing a number of common gadgets in separate com- 
partments. Each of these contrivances (a lock, door bell, clothes 
pin, and so on) was to be assembled with the aid of simple tools. 
The score was determined by the speed and accuracy of assem- 
bly. This kind of test is often described as a “jor zanple" or. 
“Vocational miniature,” since it involves what has to be done on 
a small scale. Among the sub-tests in paper-and-pencil batteries 
devised to measure mechanical aptitude are (1) tests requiring 
motor specd and dexterity of movement, (2) tests of the ability 
to visualize or perceive mechanical and spatial relations (im- 
portant in reading blueprints and in architectural drawings); (3) 
tests of mechanical information concerning tools, machines, and 
the construction and use of various contrivances; and (4) tests 
of mechanical reasoning as demonstrated in the ability to solve 

roblems dealing with tools, pulleys, levers, machine parts, and 
the like. In addition, in assessing mechanical aptitude, inventories 
are used which are designed to reveal interest in mechanical 
things. Such interest may be shown, for example, when a boy 
reads Popular Science avidly, kas his own tools, tinkers with 
radios, and builds space machines. One of the most useful find- 
ings to come out of the testing program in World War Ir wds 
the discovery that paper-and-pencil tests of mechanical aptitude 
are as predictive of success in many mechanical jobs as are actual 


job samples covering the work. н, 

Ї 1 batteries are representative or the est 

The following two test 

tests in this field: 
MacQuarrie Test of Méchanical Ability 
Bennett Mechanical Comprehension Test 
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MacQuarrie Test of Mechanical Ability* 


Aptitude Tests 


ха 


Description. This battery consists of seven paper-and-pencil 
tests, as follows: | 


18 


Sample items from the MacQuarrie tests are shown in Figure 6-1. 
All these tests are relatively simple and all are speeded: testing сс 
times are short. The MacQuarrie tests are designed to measure 
| hand-eye co-ordination, finger movement and speed, manual dex- 
terity, visual acuity, dnd spatial perception of direction and size. | 
Taken as a whole, the MacQuarrie battery measures motor dex- 
terity as a fairly low level of difficulty rather than aptitude for 
engineering or for architecture. For the latter, the Bennett Test 
of mechanical comprehension is recommended. Some of the 
MacQuarrie sub-tests are predictive of special tasks: the tests in | 
tracing, dotting and pursuit, for example, measure aptitude for 


етө 


Tracing: following a narrow path. | 
Tapping: making dots rapidly. 

Dotting: placing dots precisely. 

Copying: making a figure from co-ordinates. | 
Location: locating items by co-ordinates. 

Block Counting: counting hidden blocks in a stack. 
Pursuit: tracing a line through a tangled pattern. 


typing; and tests of block counting, tracing, pursuit, location, and 4 


copying are related to performance in mechanical drawing and 
the reading of blueprints. The Manual which accompanies the 
MacQuarrie advises the use of sub-test patterns for predicting 
success in various jobs. | 


Scope. The MacQuarrie test can be administered from grade 7 
on. It has been employed chiefly in the prediction of success in 
factory and other manual-manipulative work. | 


! . . D 
Scoring. Percentile norms are available for the sub-tests and for 
total score. The working time for the whole test is about twenty # 
minutes. Since some of the rests in the battery are allotted only 
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FIGURE 6-1 Sample Items from the MacQuarrie Test of 
Mechanical Ability 


START 


وسح وی Ов‏ 


Dolling: Place a dot in each circle as rapidly os possible. 


Blocks: Нох: many blocks touch each 


Copying: Copy figure by joining dots. 
block wilh on X on it? 


e by eye and show where il ends, by 
in the correct box af the right. 


Pursuil: Follow each lin 
writing ils number 


Reproduced by permission of the California Test Bureau. 


s, a stop watch is needed in order to time 


The reliability of the whole test is high. 
-tests are lower, but are fairly satis- 


ten to twenty second: 
the tests accurately. 

Reliabilities of the seven sub 
factory for such short tests. 


anical Comprehension Test* 
and-pencil test in which compre- 
determined by means of pic- 


Bennett Mech 


Description. This is a paper- 
hension of mechanical relations 15 
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tures and sketches. The test is fairly advanced in difficulty. Each 
icture or drawing has a simply phrased question designed to 

reveal the examince's understanding of the mechanical problem 

presented. Figure 6-2 shows samples from the test battery. 


Scope. There are four forms of the Bennett test. Form AA, the 
easiest, is suitable for trade and high schools and for less well 


trained workers. Form BB, more dificult, is for engineering 
6 


FIGURE 6-2 Samples from Bennett Mechanical Comprehension 
` Test 


Which room has more of an echo? 


. 


Which would be better shears 
for cutting metal? 


Which gear turns slower? 


Which cart is more likely to tip 
over on the hillside? 


Reproduced by permission of The Psychological Corporation. 
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school applicants, technicians, and engineers. Form CC, the most 
difficult, differentiates among examinees of high ability levels. 
The fourth form, WI, is for women. > 


Scoring. Percentile norms, which are supplied for each test 
form, are applicable to a variety of student and occupational 
groups. The test is valuable in guidance, in selecting applicants 
with aptitude for mechar.ical thinking, and in thc selection of 
students wanting to study mechanics and engineering. The Mac- 
Quarrie is a useful supplement to the Bennett test when speed 
and manual dexterity are required as well as.more abstract think- 
ing about mechanicai relations. 

The reliability of the Bennett is satisfactory. Validity is hard 
to determine, but the test is valid in relation to such criteria as 
grades in high-school shop courses and occupational and in- 


dustrial performance. 


Clerical Aptitude 


Tests planned to gauge clerical aptitude are concerned mainly 
with perceptual speed and accuracy in reading, writing, and 
marking, and with manual dexterity and skill. Office workers 
are designated in several ways, such as general clerk, sales clerk, 
shipping clerk, filing clerk, typist, and receptionist, The jobs 
differ in the kind and variety of their duties, but all demand (to 
a greater or lesser extent) reading, writing, sorting, checking, 
filing, folding, sealing, and stamping. | 

The present section will describe two tests of clerical apti- 
tude, the first fairly narrow in functions covered, the second 


much broader. 
Minnesota Clerical Test 
General Clerical Test 


Minnesota Clerical Test* / 
Description. This battery covers “speed and accuracy in per- 
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ceiving clerical detail. There are two parts, number comparison 
and name comparison. In the first, the examinee is shown two 
hundred pairs of numbers each containing from 3 to 12 digits. 
If the two numbers are alike, the examinee places a check (У) 
between them; if they are unlike, he leaves the space blank. In the 
second test, proper names (which match or fail to match) are 
substituted for number pairs. Samples are shown below: 


70542 — —— — 29524 
5794367 Y — 5794367 
John C. Linder — — John C. Lender 
У 


Investors' Syndicate Investors' Syndicate 


The Minnesota Clerical Test is not designed to encompass all 
the factors which make for proficiency in office work, but it 
does attempt to predict ability to handle addresses, bills, accounts, 
and so on. The Minriesota test has been found to have prognostic 
value in the selection of clerks, packers, checkers, inspectors (of 
products), and other factory jobs. 


Scope. This clerical test may be used with students from junior 
high school on and for adults. 


Scoring. The working time of the test is about fifteen minutes, 


so that both speed and accuracy enter into а score. Individual 
differences appear in the scores and must be taken into account 
in interpreting the test. A very careful examinee, for example, 
may make few errors but earn a relatively low score because of 
slowness and over-cautiousness. On the other hand, a fast but 
careless worker may mark more items but tend to make many 
errors. Percentile norms are available for boys and girls, junior 
and senior high-school students, and several groups of industrial 


workers. Among the latter there are norms for women who are, 


machine operators, typists and clerks; for men who are tellers 
(bank), accountants and various sorts of clerks. A high score 
earned by a student does net necessarily mean that this examinee 
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will make a good clerical worker, though it is a decidedly good 
omen. On the other hand, a high-school counselor would cer- 
tainly be wise to question the vocational promise of 3 com- 
mercial and business student who scored below the twenty-fifth 
percentile of clerical workers. The reliability of the test is high. 


General Clerical Test (GCT)* 


Description. This test battery is designed to measure three kinds 
of aptitude judged to be valuable in office work. There are nine 
sub-tests in the battery. Parts I and II test clerical speed and 
accuracy; Parts III, IV, and V numerical ability; Parts VI, VII, 
VIII and IX verbal facility. The first two (checking and alpha- 
betizing) measure perceptual speed and accuracy as expressed in 
such activities as sorting, coding, and alphabetizing. The next 
three measure numerical aptitude as shown in computation, error 
location and arithmetical reasoning. The iast four measure verbal 
facility by means of spelling, reading, comprehension, vocabu- 
lary, and grammar. The over-all score is a good measure of 
abstract intelligence, as well as of aptitude for clerical work. 
The test is to be recommended, therefore, for clerical jobs which 


demand a relatively high level of intelligence. 


intended for use with high-school and 
business school students. The GCT may also be valuable when 
testing applicants for more responsible clerical positions. The 
working time for the test is about fifty minutes. 

tile norms are available for high schools and for 
as well as for various sorts of clerical workers. 
s well as for total score, are provided. 
The reliability of the whole test is high—greater than .90. The 
reliability of the sub-tests is much lower, and the counselor must 


be tentative in judgments based upon parts of the test. 
Corporation, New York, N. Y. 


Scope. The battery is 


Scoring. Percen 
business schools, 
Norms for each sub-test, a 
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APTITUDE TESTS IN SPECIAL AREAS* 


Ir this section five test batteries often useful to the educational 
counselor and classroom teacher will be described. These spe- 
cialized examinations are illustrative of many tests in this field: 


Differential Aptitude Tests 
Minnesota Paper Form Board, 
°  Murphy-Durrell Diagnostic Reading Readiness Test 
Orleans Algebra Prognosis Test 
Turse Short-Hand Aptitude Test 


Differential Aptitude Test (DAT)** 


Description. This battery is designed for educational and voca- 
tional guidance of high-school students. There are seven sub- 
tests, each of which yields a separate score: 

t 


Verbal reasoning: A difficult verbal analogies test, which measures ability 
to handle verbal relations. Aspirants for professions should earn high 
scores. 

Numerical ability: An arithmetic test covering a wide range of opera- 
tions. This test is an important predictor in science and engineering. 

Abstract reasoning: A non-language test which demands the solution of 
problems expressed in diagrams and figures. The test measures a high 
level of abstract intelligence. 

Space relations: Ability to perceive a three-dimensional object from a 
two-dimensional pattern. Useful in engineering, architecture, and 
drafting. 

Clerical speed and accuracy: A. test of speed and accuracy in the per- 


formance of clerical tasks. Speed is an important factor. 
Mechanical reasoning: A form of the Bennett Mechanical Comprehension 


ften listed sensory-motor tests of visual and audi- 
tory keenness, as well as special tests of motor skills, dexterity and co-ordination. 
Apparatus tests of this sort are valuable in industry and the military service, 
but they are not used routinely in the schools and will not be described here. 
Some of the devices are very complex and require specialized training on the 
part of the examiner. Oral Trade Tests constitute another sort of specialized 
aptitude test which will not be treated here. These tests are really oral inter- 
views, are administered individually, and are valuable in appraising the voca- 
tional training and work experience of an applicant. 
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Test. Useful as a predictor of engineering aptitude when combined 


with the first four tests above. 

Language usage: Two tests scored separately which measure the ability 

to spell and to locate errors in sentences. Emphasizes the ;nechanics 
of language as compared with test £1, which emphasizes abstract 


comprehension. 


FIGURE 6-3 Sample Items from tbe Differential Aptitude Tests 


MECHANICAL REASONING 
Which man in this picture has the heavier load? 


CLERICAL SPEED AND ACCURACY 
‘Test ITEMS 


In each test item, one of the five combinations is underlined. 
Find the same combination on the answer sheet and mark it. 


LANGUAGE USAGE: ! Spelling 


Indicate whether each смога is spelled right or wrong. 


EXAMPLES SAMPLE OF ANSWER SHEET 


Ww. man 


x gurl 


E: П Sentences 
f the lettere 
nding letter: 
xt week / at all. 
D E 


LANGUAGE USAG 
Decide which o 
if any, mark the correspo! 
Ain't we / going to the / office / пе: 

. А B с ы 
on of The Psychological Corporation. 


d ports of each sentence contains errors, 
s on the answer sheet. 
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The illustrative items in Figure 6-3 show the nature of the 
sub-tests. 

A main feature of the DAT is that the total score is broken 
down into several components, so that from a student's profile 
we have a record of comparative performance in eight funda- 
mental activities. The Manual gives explicit instructions for ad- 
ministering and scoring the test battery. In addition, a Casebook 
illustrates the use of the profile ia diagrosis, and will be helpful 
to guidance counselors. 


Scope. For grade 8 and for high-school grades 9-12. 


Scoring. Percentile norms are supplied for grades 8 through 12 
for total score and for scores on each sub-test. Since there are 
large sex differences, percentile norms are given for boys and 
girls separately. Scaled scores (with a mean set at 50) are em- 
ployed in plotting the profiles. Figure 6-4 shows the profile of 
a boy who could profit from educational counseling, Note that 
James is high in the space and mechanical tests, but mediocre to 
Jow in all the others. The boy is certainly not “verbally minded,” 
although he appears to have real talent in mechanics. The teacher 
will understand James better if he has his profile available. 

The DAT represents the modern practice of substituting a 
number of analytic scores (for example, on a profile) for a single 
over-all score. We have noted (page 113) that diagnosis of 
strong and weak points from short sub-tests is always precarious 
because of their low reliability. The reliability of the total DAT 
is very high, and the authors have increased the value of a diag- 
nosis from the sub-tests by computing the minimal difference 
between sub-test scores which will be significant, that is, non- 
chance. This makes it possible to say, for instance, that Roy’s 
score in abstract reasoning is significantly higher than his score in 
clerical speed and accuracy, or that Betty’s scores in verbal 
reasoning and numerical ability do not differ significantly. | 

Despite its general excellence, the DAT has some practical 
drawbacks to its use in schoois. For onc thing, the battery is 
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FIGURE 6-4 Profile of a Higb-Scbool Boy on tbe Differential 
Aptitude Tests 


c 


DONTON, DIFFERENTIAL APTITUDE TESTS 
G. К. Bennen, Н. G. 5езз%оге, and A С. Wesman 


THE PSYCHOLOGICAL CORPORATION 


зох а з 5 325 


wuy 5» 5 329 


e Psychological Corporation. 
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nately three hours) and the cost 


relatively high. Good norms are available for the high-school 
grades (boys and girls taken separately), but there are relatively 
few data on occupational and vocational groups. The battery 
appears to have content validity, and various experimental studies 
indicate that it possesses empirical validity. For example, workers 
' in the electrical, mechanical, and building trades score above 
average on mechanical reasoning, and clerks are about average. 
in numerical ability and in clerical speed and accuracy and lan- 


long (working time approxin 
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guage usage. Engineering students score very high on all the sub- 
tests except the clerical tests, but are above the mean here. Men 
in the skilled trades (baker, butcher) are average in mechanical 
reasoning, and low in numerical ability, abstract reasoning, and 
space relations. Pre-medical students score high on all sub-tests, 
and especially high in verbal reasoning, numerical ability, and 
sentences. In the high school, verbal reasoning and sentences are 
predictive of grades in English; numerical ability, verbal reason- 
ing, and abstract reasoning show substantial correlations with 
mathematics and science. Unfortunately, the data do not reveal 
how successful a man is likely to be over a period of time in 
a profession, occupation, or trade. But the tests often provide 
significant clues. 


Minnesota Paper Form Board (MPFB)* 


Description. This is a well-known paper-and-pencil test dealing 
with spatial relations. It represents an effort to put a formboard 
on paper. Sample items are shown in Figure 6-5. 

Each test item presents a geometrical figure cut into two or 
more parts. The examinee is to decide how the parts would look if 
fitted into a complete figure; he does this by selecting the draw- 
ing which shows the correct arrangement. Studies have shown 
the Minnesota Paper Form Board to be a good index of ability 
to perceive spatial relations and to manipulate figures in two 
dimensions. The test is useful as an aid in predicting success in 
shop work, grades in technical courses, in dentistry, art work, 
and shop and factory output. It does not tap the more intellectual 
aspects of engineering—for instance, the ability to use symbols 
in solving problems. But it does test one component in engineer- 
ing skill. A boy scoring high in the MPFB is not necessarily apt 
in engineering, dentistry, or art, but he has promise and is worth 
further examination. On the other hand, a boy who scores low 
had best be encouraged to try some other kind of work. As 
often happens, we can give negative educational and vocational 
advice with far greatcr assurance than we can offer a positive 
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FIGURE 6-5 Sample Items from tbe Minnesota Paper Form 
Board Test 


1 
1 
1 
| 
1 
П 
EI 


o 


Directions: For each item choose the figure which would result if the 
pieces in the first section were as'embled. 


Reproduced by permission of The Psychological Corporation. 


course of action. Thus, we can tell a youngster that he had 
better not attempt engineering, but we cannot always offer him 
specific advice as to just what he should do. 


Scope. Grade 7 and above. 


Scoring. Norms are available for school grades and for various 
occupational groups. There are two forms of the MPFB, and 
the test is easy to administer and score. Counselors have found 
the test useful as a supplement to verbal intelligence and achieve- 


ment tests, especially for students planning to study architecture, 


engineering, commercial art, and other vocations requiring spatial 


perception and visualization. 


Murphy-Durrell Diagnostic Reading Readiness Test* 


Description. This test has been designed to measure three char- 
acteristics believed to be important in the acquisition of reading 
skills: auditory discrimination, visual discrimination, and learn- 


* Published by the World Book Company, Yonkers, N. Y. NN 


146 Aptitude Tests 


ing rate. Like other readiness tests it is prognostic, in that it 
forecasts whether ог 'пог a child is ready to begin-reading. It is 
also an achievement test and could be so classified, as it measures 
the educational maturity of a youngster. The Metropolitan Read- 
iness Tests (page 118) may, in turn, be classified as aptitude tests 
rather than as achievement tests. 

The Murphy-Durrell test provides useful information for the 
first-grade teacher in deciding when to start a formal reading 
program and what outcomes to expect. At the same time, a good 
intelligence test will be useful in estimating general mental 
maturity. " 


Scope. Early in the First Grade or before. 


Scoring. Test 1 and Test 2 (auditory discrimination and visual 
discrimination) require about an hour each. Test 3 (learning) is 
both an individual and ^ group test; there are twenty minutes for 
group instruction and three brief individual periods. Obtained 
raw scores are converted into percentile norms for Tests 1 and 2; 
ratings are used in Test 3. 


Orleans Algebra Prognosis Test (Rev.)* 


Description. This is a prognostic test, the purpose of which is to 
determine whether a pupil is likely to succeed in (is ready for) 
algebra. The test is administered before the pupil undertakes the 
study of algebra. There are rine parts, consisting of simple 
lessons covering some aspect of algebra—for example, use of 
symbols, substitution in equations, literal nomenclature, and 
solving of problems, followed by tests on the material presented. 
An arithmetic test and a summary test of the material are in- 
cluded. The test has been shown to have good prognostic value, 
as indicated by its correlations with algebra grades and achieve- 
ment test scores in algebra. 


Scope. For students planning to study algebra. 
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Scoring. The test requires about forty-five to fifty minutes. 
"There are percentile norms corresponding to point scores. Fur- 
ther, there are expectancy charts predicting how we'l a child 
making a certain score can be expected to do in algebra. The 
reliability of the test is satisfactory. 


Turse Shorthand Aptitude Test* 


Description. This test is illustrative of aptitude tests developed 
for use with commercial and vocational subjects. The purpose of 
the test is to determine whether an examinee is likely to be 
successful in learning shorthand. There are seven sub-tests: strok- 
ing, spelling, phonetic association, symbol transcription, word 
discrimination, dictation, and word sense. 


Scope. For students planning to study shorthand. 


Scoring. There are percentile norms fur students beginning the 
study of shorthand. The Turse test is correlated with achieve- 
ment in shorthand and is valuable in prognosis. The working 
time of the test is about an hour. 


APTITUDE TESTS FOR THE PROFESSIONS 


Tests of aptitude for the professions are primarily achievement 
tests designed to forecast a student’s chances of success in train- 
ing for medicine, law, or engineering. These tests are specialized 
in content and are essentially work samples in the designated 
field. Professional aptitude batteries are validated against grades 
in courses. It is not known precisely just how predictive these 
tests are of success in the actual practice of a profession, but 
there is some evidence that such aptitude tests are related—some- 
times highly related—to later success. B | 

The classroom teacher should be familiar with the general 


content and purpose of the professional aptitude tests, though 
he will rarely be called on to administer or score them. These 
: 
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batteries are not generally available, are often part of a testing 
program, are highly specialized, and are usually scored and 
interpreted in a testing center. We shall, accordingly, give a less 
detailed description of them. 


Medical College Admission Test* 


Description. This test consists of four parts: verbal, quantitative, 
understanding modern society, and science. The verbal section 
includes tests of vocabulary, and reading comprehension tests 
in science, social studies, and the humanities. The quantitative 
part requires that the examinee solve problems making use of 
numbers and symbols. The “understanding society” section is a 
multiple-choice examination covering current social, economic, 
and political affairs. The science part of the test contains ques- 
tions drawn from pre-medical courses in biology, chemistry, and 
physics. Samples from the various sections reveal the character 
of the examination. 


Verbal section 


sporadic: (A) immediate, (B) regular, (C) occasional, (D) alter- 
nate. (E) replete 


Quantitative part 
12. One-fifth of a batch of 2000 radio tubes were defective. If one-, 


fourth of the first 1000 were defective, what fraction of the 
second 1000 were defective? 


(A) 1/20 (B) 1/10 (C) 3/20 (D) 9/40 (E) 3/10 


Understanding society 

18. Which of the following was the primary objective of the 
nations which signed the North Atlantic Pact? 
(A) To form an alliance for military conquest. 
(B) To insure economic stability in democratic states. 
(C) To replace the Marshall Plan with a new alliance. 
(D) To destroy the effectiveness of the Soviet veto in the 

United Nations. 

(E) To unite for collective defense. 
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Science . 
21. А sodium atom and a sodium ion 
(A) contain the same number of electrons 
. (B) contain the same number of protons 
(C) have the same chemical properties 
(D) have the same physical properties 
(E) have different atomic numbers 


The first three parts of the battery are related directly to 
standing in medical school. The “understanding society” section 
is not related to medical knowledge, but is included in an 
attempt to select candidates for medicine who will be successful 
in adapting to the needs of the time. In their instructions to candi- 
dates, the authors write that “the test is intended to complement 
other data (your total college record, interviews, references and 
recommendations) with an objective inventory of your skills, 
concepts and information . . . acquired from formal study and 
from experience."* 


Law School Admission Test** 


Description. This battery is designed for use in selecting the 
best candidates from among those applying for law school. The 
battery has six parts: principles and cases, data interpretation, 
reading comprehension, debates (the examinee determines 
whether a statement supports, rcfutes, or is irrelevant to a given 
resolution), best arguments, and paragraph reading. Some of the 
material is difficult. The test battezy has a correlation of about .50 
with law school grades. When combined with college marks, 
it is highly predictive of success in law school. 


Pre-Engineering Ability Test** 
Description. This test consists of two sorts of material: (a) 
comprehension of scientific materials, and (b) general mathe- 


* Medical College Admission Test, Bulletin of Information, Educational 


Testing Service, Princeton, N. J. 1957, p. 22. | 
+ Published by the Educational Testing Service, Princeton, N. J. 
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matical problems designed to measure competénce in this arca. 
The first part of the test involves reading scientific prose, tables 
and grephs, and answering questions based upon these materials. 
The second part consists of problems in arithmetic, algebra, and 
geometry. The Pre-Engineering Test correlates about .50 with 
grades in the first term of engineering school. The reliability 
of the battery is high. 


National Teacher Examination 


Description. These examinations are planned for use by school 
systems as an aid in the selection of teachers, and they are em- 
ployed also by teacher-training colleges as a means of evaluating 
their students. The examinations are constructed, administered, 
and scored by the Educational Testing Service. Their objective 
is the measurement of professional background, general intelli- 
gence, and general culture. There are two parts of the battery, 
four common examinations, and a series of optional examinations. 
The first set covers a student's general background for teaching, 
and the second his mastery of some special field. 

The common examinations comprise the following sub-tests: 
Professional information: Child development, educational psychology, 

guidance, measurement, principles and methods of teaching. 

General culture: Sections on science and mathematics and on literature, 
history and the fine arts. Examinations cover the development and 
current state of affairs in these fields. 

English expression: Grammatical errors to be detected in sentences. 

Non-verbal reasoning: A pattern completion test in which the examinee 
must fathom the relationships in a given figure and choose the correct 
figure to complete the pattern. 

The optional examinations cover eight areas of specialization: 
education in elementary schools, early child education, biological 
sciences, English, industrial arts, mathematics, physical sciences, 
and social studies. The four common examinations have exhibited 
substantial relationships with ratings for effectiveness of teachers 
by supervisors. The tests de not attempt to measure personality 
factors, interest, or drive. 
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TESTS OF ARTISTIC APTITUDE OR TALENT 


Tests in this area are concerned with finding whether an 
examinee possesses some of the factors which appear to be neces- 
sary for success in music or in art. So many traits contribute to 
the success of an artist òr a musician that it is impossible for an 
aptitude test to do more than tap some of the гпоге obvious com- 
ponents. Perhaps the bést the'aptitude tests can do in many in- 
stances is to aid the counselor in steering away from the arts those 
aspiring students who have no real talent and whose money and 
time might be better spent in other pursuits. 

It is doubtful whether the classroom teacher will have the 
time, the training, or the equipment needed to administer and 
interpret the aptitude tests in this area. Teachers engaged in 
guidance should be familiar with such tests, however—with what 
they are and what they are trying to do. Two tests of music 
and one of art will be described in this section. They аге repre- 
sentative of aptitude measures in this field. 


Seashore Measures of Musical Talents 
Diagnostic Tests of Achievement in Music 


Meier Art Judgment Test 


Seashore Measures of Musical Talents* 

Description. This is a test of “ear for music.” The test battery 
consists of six separate tests coyering such attributes of tone as 
pitch discrimination, loudness, rhythm, time, timbre, and tonal 
memory. The tests are given by means of phonograph records. 
Each test item or problem presents a pair of tones ог а tonal ' 
sequence. In the second playing, one of the tones is changed, 
or the sequence of tones is altered in some way. In the pitch 
discrimination test, the examinee marks on a test sheet whether 
the second tone is higher (H) or lower (L) than the first. Com- 
parisons become progressively more difficult as the difference 
in pitch between the two tones decreases. In the time and loud- 


* Published by The Psychological Corporation, New York, N. Y. 
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ness tests, the second, or comparison, tone differs in strength 
or in pattern from the first. The rhythm test requires the exam- 
inee to decide whether the second of two patterns is alike or 
different from the first. The timbre and tonal memory tests 
differ somewhat from the others. In the first, two tonal бв 
аге compared for quality (consonance); in the second. н short 
series of from three, to five tones is played, and then 1; layed a 
second time with one note changed. The subject нА arte 
down the number of the altered note. The stimuli (tones) nto 


pm нк phonograph records are as pure (uncomplicated) 


e 


Scope. The Seashore Tests are applicable from the fifth 
on. | 


grade 


Scoring. Scores from the six sub-tests are plotted on a profil 
to give a graphic represenration of performance. Percentile зоа 
are available for fifth and sixth graders, seventh and eighth em 
ers, and adults. The Seashore Tests have been used in arat of 
music and in music courses in academic schools. The tests ad- 
mittedly do not run the gamut of musical talent, but they do 
measure important aspects of musical aptitude. A child who ranks 
low on these tests has a poor ear for music and is a doubtful 
selection for extensive musical training. The reliability of the 
battery runs about .80. | 


Diagnostic Tests of Achievement in Music* 


Description. As the name implies, this test battery is designed 
to find how well students have acquired the theory and tech- 
nical knowledge needed to read and understand music. The 
test consists of ten parts: diatonic syllable names, chromatic 
syllable names, number names, time signatures, major and minor 
keys, note and rest values, letter names, signs and symbols, key 
names and song recognition. Test content is based on materials 


* Published by the California Test Bureau, Los Angeles, Calif. 
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recommended by musical authorities as fundamental in musical 
education. A piano is required for the tests. 


Scope. For grades 4 through 12. Test items are graded up 
sharply in difficulty. ' 


Scoring. Norms for the test are based on the degree of mastery 
shown by students for the various sorts of material. Strengths and 
weaknesses are revealed by comparison of scores upon the ten 
parts of the test. The reliability of the whole test over the rather 
wide range for which it is applicable is very high. Reliability for 
separate grades is lower, but probably satisfactory. Working 
time for the test is about sixty minutes. The Diagnostic Tests 
are a useful supplement to "ear" tests like the Seashore. An ear 
for music is necessary for any musical activity, whereas a knowl- 
edge of the technical aspects of music is necessary for one 
aspiring to be a musician. 


Meier Art Judgment Test* 


Description. This test consists of a hundred problems in each of 
Which an artistic judgment is demanded. Fach test item is pre- 
Sented in two versions. In the first version, there is a painting or 
drawing by some well-known artist, or an acknowledged artistic 
design; in the second, the same theme is presented but in altered 
form, the change being in symn.etry, balance, unity, or rhythm. 
All pictures are in black and white, so that no complication is 
introduced (nor any clues) by color. The examinee is told that 
the two versions of the picture differ and is asked to select the 
better version. The test is, accordingly, a measure of aesthetic 
judgment, the criterion being the consensus of experts in art. 


See Figure 6-6 (facing page 150). 
Scope. The Meier test is intended for junior and senior high 
Schools, as well as colleges and art schools. 


* Published by the Bureau of Educational Research and Service University of 
lowa, Iowa City, Iowa. у 
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Scoring. Norms are for students in art courses. High scores do 
not necessarily mean that the student is destined to be an artist. 
But a low score should be a warning signal to one planning a 
career in art. The Meier test has correlations of from about .45 
to .50 with grades in art courses, but low correlation with scores 
on verbal intelligence tests. This does not mean that artists are 
unintelligent, but that many factors besides abstract ability must 
enter into artistic appreciation. The reliability of the test is 
about .75 for fairly homogeneous groups. 


HOW TO JUDGE. AN APTITUDE TEST 


Like tests of intelligence and of educational achievement, apti- 
tude tests must be judged by the adequacy of their validation, 
reliability, scaling. and norms. Various comments concerning 
these aspects of the tests:described in this chapter have been made 
in appropriate places. This and other material will now be 
summarized. 


Validity. Aptitude tests generally possess content validity. Tests 
of speed, dexterity, seeing mechanical relations, solving mechani- 
cal problems, and the like seem proper for measuring mechanical 
aptitude. Moreover, tests of sorting, writing, reading, and alpha- 
betizing appear to be appropriate for assessing clerical ability. 
In the tests of professional aptitude and in those of talent, the 
content has been chosen with a view toward forecasting per- 
formance in school and (hopefully) in life. 

Aptitude tests have been validated against various criteria, in- 
cluding grades in courses and success in vocations or trades. Such 
measures of practical or working validity have been determined 
for the Minnesota Paper Form Board, the MacQuarrie and the 


DAT. 


Reliability. The reliability of the standard aptitude tests is 
generally satisfactory, and we can have confidence in the stability 
of a score. In some cases, the standard error of a score, as well 
as the reliability coefficient, is given by the author of а test. 


X 


e 
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Scaling. In most aptitude tests, raw scores are converted into 
percentile ranks. In some tests (the DAT, for example) scaled 
scores are used on the profile in making comparisons of a given 
student's scores. х 


Norms. All aptitude tests have norms either in percentiles or in 
scaled scores. A few, tests give norms for certain occupational 
groups. One drawback to the use of vocational aptitude tests, 
however, is the lack of adequate norms in many job areas. There 
is a need for information regarding the predictive value of pro- 
fessional and vocational tests for persons long out of school. 
It would be a greát advance if we knew how well an aptitude 


‘test culd forescast the success of engineers or lawyers, for ex- 


ample, and not simply grades in courses. 


SUGGESTIONS FOR READINGS 


Anastasi, A. Psychological Testing. New York: Macmillan, 1954. 

Cronbach, L. J. Essentials of Psychological Testing. New York: 
Harper, 1949. М 

Greene, E. В. Measurements of Human Bebavior (Rev. edition). New 
York: Odyssey Press, 1952. 

Nol, V. Н. Introduction to Educational Measurement. Boston: 
Houghton Mifflin, 1957. 


SUGGESTIONS FOR LABORATORY WORK 


1. Administer several standardized aptitude tests to the class. Cut the 
time allowance if necessary. Students should score their own tests and 


plot profiles when called for. 

2. Find in the Manual the specifications which the author lays down 
for his aptitude test. Examine the items of the test. Do you agree that 
the test has content validity? Are any data given on experimental 
validity? 

3. Make a study of the Differential Aptitude Tests. Analyze the battery 
for validity, reliability, scaling procedures, and norms. 


\ 
QUESTIONS FOR DISCUSSION 


1. How do aptitude tests differ from readiness tests in purpose and 
content? 
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2. Why does а test battery like the Minnesota Clerical Test vary 
greatly in the accuracy of its forecasts of office work? 

3. Why are aptitude tests used more often in high schools than in 
elementary Schools? 

4. Give some reasons why  paper-and-pencil tests of mechanical 
ability are as useful as are work samples in determining aptitude. 

5. How would you set up a program for selecting candidates for a 
nursing school? Outline the procedures you would use. 

6. Would you use the Bennett Mechanical Comprehension Test to 
select workers in an automobile factory? 

7. It has been said that the best measure of aptitude for mathematics 
(or for any subject) is the achievement to date. Do you agree? 

8. How could you discover whether the Meier Art Test is measuring 
native artistic ability and not training in art? 

9. A girl of 16 scores very high on the Seashore Music Test. Would 
you advise her to undertake a career in music? Why or why not? What 
else might you need to know about her? 

10. How can a follow-up study of graduates of law and medical schools 
be useful to a counselor using a professional aptitude test? 


is 


ч 


„ы... 


СНАРТЕК 7 


PERSONALITY TESTS 


In previous chapters, we have indicated on several occasions 
that prediction of success based upon measures of intelligence, 
school achievement and aptitude must always be qualified by the 
statement "provided the personality traits are favorable." In the 
present chapter, we shall attempt to see how well we can deter- 
mine favorable and unfavorable personality traits. 

There are a number of descriptions of personality, and the 
usefulness of any definition will depend in most cases upon the 
purposes of the author. For the teacher or school counselor, a 
practical working definition is to the effect that personality is a 
student's characteristic way of doing things. Suppose that two 
boys, John and Jim, are about the same age, have about the same 
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IQ, and do about the same caliber of school work: But suppose 
further that John is friendly, highly motivated, and likeable; that 
Jim is sullgn, indifferent in attitude, and generally avoided by 
teachers and classinates. The decided contrast in the behavior of 
these two boys arises from their distinctive personality, traits, 
not from their differences in mental ability. Failure in school— 
or in life—is often the restilt of a person's inability to make good 
use of fis*potential personal assets. Obviously, lack of success 
may arise either from negative (unpleasarit) personality traits or 
from failure to make use of positive (pleasant) personality traits. 

The psychologist has attempted to evaluate personality traits 
in three ways:v(1) by rating scales (2) by questionnaires and in- 
ventorics, and6) by what are called “projective tests.” The first 
two approaches are the more readily applicable in the schools. 
Projective tests should be employed only by clinical psycholo- 
logists and psychiatrists, since they require special training for 
their administration and interpretation. These “tests” are essen- 
tially clinical instruments and are most often used diagnostically in 
cases of severe personality disorders or in behavior problems where 
drastic emotional disturbances are suspected. Rating scales and 
inventories, on the other hand, can be administered and inter- 
preted in a useful way by teachers and counselors. 


RATING SCALES 


The rating scale is a device for obtaining judgments of the 
degree to which an individual possesses certain behavior traits and 
attributes not readily detectable by objective tests. In the school 
situation, rating scales provide appraisals of a teacher (or of a 
candidate for a teaching position) in several characteristics. Rat- 
ings by teachers or principals are often required for students 
seeking entrance to college or looking for a job. In making a 
rating, the judge expresses his opinion by marking along a grad- 
uated scale or by checking in the category which he feels best 
describes the person being rated. 


e 


FIGURE 7-1 Sample Items from Various Grapbic Rating Scales 


1. From a graphic rating scale for clerical workers: 
Accuracy—Consider carefully quality of work, freedom from error. * 


~ по i very Ш few careless i many 


errors careful errors errors 


2. From a behavior rating scale for children: 
Is his attention sustained? 


Distracted: Difficult Attends Is Able to 
jumps rapidly to keep at adequately. absorbed hold 

from one thing a task until in what attention 

fo another. completed. he does. for long 

periods. 

(5) (4) (3) (2) а) 


3. From the American Council on Education Rating Scale for 
prospective college students: 
Does he get others to do what he wishes? 


Probably Lets others Sometimes Sometimes Displays marked 
unable to take lead. leads in leads in ability to lead 
lead his minor important his fellows; 
feliows. affairs. affairs. makes things go. 


4. From a rating device for teacher candidates: 
(Put а check under the appropriate heading) 


T Very inferior | Inferior Average Superior Very superior 
act 


5. From a roting scole for teachers: 
(Circle the number which best indicates the degree or extent to 
which the qualities ore practiced.) 
O=unsatisfactory; 1 = below average; 2=average; 
S—above average; 4-— superior. 


Emotional maturity: 
To what extent does the teacher exhibit desirable ó 132 3 4 


balance between emotional responsiveness and emotional 
control? Consider disposition, sense of humor, restraint 
and thoughtfulness in dealing with others, feelings of 
Security, objectivity of interest, freedom from excessive 
fears and worries and warmth of feeling and expression. 


6. From a rating scale for officer candidates: 
Relations with fellow candidates: 


Uncooperative ` Grudgingly Cooperates — ' Cooperates Leads and 


Cooperative and willingly cooperates. 
contributes Goed ideas. 
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Perhaps the most useful type of rating device is the Graphic 
Rating Scale or some variation of it. The typical graphic scale 
consiste of a straight line, for example, five inches long, which is 
taken to represent the range of behavior in the trait. In lieu of a 
line, several categories representing gradations in the trait may 
be provided. The illustrations in Figure 7-1 are samples from 
various rating scales. 

Units on the graphic rating scale are represented by the suc- 
cessive scale divisions, but the directions make it clear that the 
check mark indicating the judgment may be placed anywhere 
along the scale line. A graphic scale is oftea scored by separating 
the scale line into one hundred points. A person's rating is then 
determined by the distance of the judge's check from the low 
end of the scale. A more summary method is also used: if there 
are five main divisions on the scale, the highest division may be 
designated “1,” the next division “2,” and so on down. 


Requirements of a Good Rating Scale 


A good rating scale should satisfy the following requirements. 

Traits should be carefully defined. A sentence or a phrase is 
more informative than a single word. Thus, the meaning of “tact,” 
"initiative," or “dependability” should be pinned down by 
descriptive phrases or by actual examples. Intelligence, energy 
level, personnal appearance, work habits, and the like are better 
judged than are character traits (loyalty, courage, unselfishness) 
because they are more readily observed in social behavior. Char- 
acter traits must be inferred from a variety of behaviors. It often 
helps to clarify a rating if the judge is required to record specific 
instances (“behaviorgrams”) of the trait which support his 
opinion. On the ACE scale, for example, space is left in which 
the judge may provide observations which justify his rating. 

А good scale avoids terms which are hodgepodges of a number 
of activities—for example, “standing in the community," “ 
position" or "moral qualities." By the same token, a good scale 
avoids narrow, specific terms. The dean or principal rarely knows 


social > 
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intimate details dbout a teacher—whether he sings in the choir, 
loves his mother, or plays golf well. Information of this sort is 
often called for, however. Judges do not often have occasion to 
observe or to learn abdut personal behavior and, in general, 
should not be asked to supply such information. 

The number of divisions on the scale should be neither too 
numerous nor too few. The optimum number of divisions on a 
graphic scale is perhaps five to seven. Fewer divisions than five 
causes the groupings to be too coarse; more than seven divisions 
demands fractionings of the trait which are too fine for most 
raters, with the result shat a large part of the scale may be unused. 
A five-division scale is popular, since it corresponds to the mark- 
ing system A, B, C, D, and E. Furthermore, the five categories 
“high,” “above average,” “average,” “below average,” and 
“poor” seem to mark off fairly natural divisions. 

Directions to the rater should be explicit. The adequacy of the 
directions given the rater will have a substantial effect on the 
validity and reliability of his ratings. The rater (1) should be 
given as explicit directions as possible, (2) should be told what 
is meant by the distribution of a trait, and (3) should be warned 
against assigning too many “average” ratings. This last is some- 
times needed when the persons to be rated are not well known 
to the rater, when the meaning of the traits is not well under- 
stood, and when the rater is overcautious. Raters must be warned 
against the “halo effect" and the tendency to see logical relations 
among traits—to assume, for exaruple, that intelligence and moral 
behavior or intelligence and good work habits of necessity go - 
together. : 

As for the distribution of traits, the best first hypothesis (in lieu 
of other information) is to assume that ratings will be distributed 
in the form of a normal curve. When the baseline of the normal 
curve is subdivided into five equal parts, the percentage in each 
division (reading from either end of the curve) are 7, 24, 38, 
24, and 7. If there are seven divisions on the scale, the per- 
centages in subdivisions are 4, 10, 22, 28, 22, 10, and 4. The direc- 
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tions should make it clear that the exact proportións in the normal 
curve should not be followed slavishly. But the point should be 
stressed.that when a group of teachers is rated for “skill in instruc- 
tion" on a five-division scale, we must expect many more in the 
middle of the scale than at either end. 

The tendency to assign too many “average” ratings (the “cau- 
tion factor") is to be contrasted with the tendency to assign too 
many high ratings (the "generosity factor"). Stress on the 
normal distribution of most traits will help correct this tendency. 
If the low end of the rating scale is described by such unpleasant 
terms as "stingy," "stupid," "mean," this part of the rating line 
may be avoided by many raters. The “halo effect" mentioned 
above is the tendency to rate a person high on all traits jf he is 
well liked or is regarded as highly intelligent. Conversely, if a 
person is disliked, there is a tendency for him to be rated low on 
all traits. To minimize halo, the rater is usually told to rate all 
candidates on a single trait, then to rate all on a second trait, and 
so on. This procedure is impractical, of course, when the rater 
is called on to consider one person at a time and rate him on, say, 
ten traits. We are often forced, therefore, to resort to warnings 
against halo and careful definition of traits. 


Validity and Reliability of a Rating Scale 


Ratings for intelligence and for special aptitudes made on a 
graphic rating scale can be validated against objective test scores. 
But ratings of personality traits cannot be so validated, since we 
rarely have criterion scores and must perforce fall back on a 
consensus of judges. In ratings of personality traits, validity and 
reliability mean virtually the same thing. If three or more raters 
agree that Brown is a skilled worker or a friendly person, the 
average of these ratings is reliable (consistent) and valid (worthy 
of confidence). If two supervisors decide independently that 
Miss Miller has a pleasing voice and sympathetic manner, our 
confidence in these judgmerts is greater than if only one super- 
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visor had so stated. In general, confidence increases with the 
number of agreements when ratings are made independently. 

It can be shown that if the estimates of two judges correlate 
.60, then the average of these ratings will correlate .75 with those 
of two equally good judges. It is, of course, difficult to decide 
when two judges are “equally good.” We can never guarantee 
this to be true, but we caa (a) select judges who at least know 
the ratees well, (b) provide careful definitions of the traits to be 
rated, and (c) allow for individual differences in rating standards. 


Summary on Rating Scales 


Ratings from graphic scales will generally deserve confidence 
when: 


1. Qualities which can be observed in behavior are rated. 
Energy, appearance, and teaching skill are better rated than 
are character and moral traits. 

2. Characteristics to be rated are illustrated. The use of behavior- 
grams (page 160) and instances will strengthen the ratings. 

3. Raters have actually observed the persons to be rated in 
situations where personality might be revealed, 

4. Independent ratings are pooled. 

5. Judges are confident that the ratings are valuable. 

6. Different standards are accounted for by explicit directions 
or by statistical techniques. 

The above rules are perhaps most useful when one has the 
problem of constructing a rating device; and they may not seem 
to be very helpful to the teacher who is faced by a ready-made 
scale, Teachers and supervisors rarely have the responsibility for 
devising a rating scale. But teachers are rated by supervisors and 
Supervisors by principals. Moreover, students are rated by 
teachers and by principals for personality traits judged important 
by colleges or prospective employers. Hence, the teacher should 
be familiar with how the rating scale is put together and how 
it works, Raters can improve a rating device by offering crit- ~ 
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icisms and commendations, Eventually comments of this sort 
should lead to a better scale. 


QUESTIONNAIRES AND PERSONALITY 
INVENTORIES 


The behavior inventory calls for short answers to a standard 
set of questions believed to bring out personality characteristics. 
The inventory or questionnaire is a formal interview or self- 
report rather than a “mental test"; the examinee is not required 
to solve problems, büt is asked to express opinions, preferences, 
and feelings. Questionnaires have been developed by psychol- 
ogists for use in three main areas: (a) personal-social behavior, 
(b) attitudes, and (c) interests. The personal data sheet or per- 
sonality inventory deals with motives and needs, as well as with 
emotional and social factors. Adjustment to life—or more accu- 
rately, perhaps, maladjustment—is revealed by a person’s self- 
report of his worries, fears, feelings of insecurity or of depression, 
frustrations, lack of confidence and the like. The typical attitude 


inventory canvasses the examinee’s feelings, opinions, and beliefs 


about various institutions (for example, the church) and about 
social and political matters (for example, war, freedom of speech 
and internationalism). Finally, the interest questionnaire deals 
with preferences for occupations, people, school subjects (such 
as physics or history), books, sports, hobbies, and avocations. 

A personality inventory may take the direct or the indirect 
approach. In the first, the examinee is asked for specific informa- 
tion: in the second, he does not know (though he may guess or 
suspect) the import of the questions. For example, the examinee 
may be asked if he is afraid of high places (direct) or he may be 
asked (among other questions) whether he would rather be a 
bookkeeper or an airline pilot (indirect). In the indirect form 


of the question, the assumption made is that an examinee is less, „+ 


likely to fake or rationalize his answers when he is not sure what 
o 


motive or what personality trait the inventory is trying to un- 


cover. А 
! 
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The Personality Inventory 

Personality inventories were used by the armed forces in 
World Wars I and II to screen out the maladjusted and those 
likely to become mentally ill. These personal data (or PD) 
sheets consisted of lists of symptoms reported by men who sub- 
sequently had suffered from "nervous breakdown" or were classi- 
fied as psychoneurotic. Advlt questionnaires have been revised 
by deleting the more serious and disturbing items so that they 
could be used in the schools. Questions which were removed 
deal with the more reprehensible forms of adult behavior such 
as those involving liquor and sex offenses. In the schools, the 
questionnaire is used to locate pupils with potentially handi- 
* capping personality problems. The acceptability of a PD Sheet 
for pupils, parents and the community is necessary if the Inven- 
tory is to be used generally as a group test. A. teacher will be well 
advised to make sure that the inventory he purposes to use has 
the approval of the school authorities. It is important that the 
reading level demanded by an inventory be carefully scrutinized, 
Since many items may not be understood. 

The personality inventory is most valuable in the schools for 
counseling and guidance—that is, for spotting pupils with exist- 
ing or potential personality difficulties. When used individually 
^ and in face to face contacts, the PD Sheet is more flexible and 
becomes essentially a directed interview. Answers given by the 
student can be pursued further until their meaning is clear. This 
cannot be done, of course, when the inventory is administered 
in group form. Of the personality inventories available (many 
Cover the same ground), the following represent acceptable 
“tests” for use in the schools: є 

California Test of Personality 
Pintner's Aspects of Personality 
Gordon's Personal Profile and Personal Inventory 
Bell's Adjustment Inventory 
"Thurstone Temperament Schedule 
Each of these questionnaires will be considered in this section. 
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California Test of Personolity* 


Description. This test series runs the gamut from the elementary 
grades to adulthood. Each battery is divided into two sections 
designed to measure (1) personal adjustment and (2) social ad- 
justment. The six sub-tests in section 1 are designed to bring out 
how a student thinks and feels about himself, his feelings of con- 
fidence and adequacy, his tendencies to withdraw within himself 
and to exhibit nervous symptoms. In section 2, the six sub-tests 
question the exarninee on his knowledge of social standards, his 
social skills, his freedom from anti-social attitudes, and his rela- 
tions to family, friends, and the community. The questions are 
Yes-No in form. Figure 7-2 gives samples from the test. 


Scope. There are five separate test batteries: 
Primary Series, kindergarten to grade 3 
Elementary Series, grades 4-8 

* Published by the California Test Bureau, Los Angeles, Calif. 


FIGURE 7-2 Sample Items from the California Test of Per- 
sonality, Elementary, Grades 4-5-6-7-8, Form AA 


PERSONAL ADJUSTMENT (Circle YES or NO) 

10. Do your parents or teachers usually need to tell you to 
do your work? 

23, Do people often think that you cannot do things well? 

25. Do you feel that your folks boss you too much? 

38. Are you proud of your school? 

50. Would you rather stay away from most parties? 

68. Do you often feel tired before noon? 


SOCIAL ADJUSTMENT 

77. 15 it necessary to thank those who have helped you? 

87. Do you help new pupils to talk to other children? 

101. Do people often act so mean that you hove to be 
nasty to them? 

114, Do you like both of your parents about the same? 

123. Is it fun to do nice things for some of the other 


boys or girls? 
139. Do you try to get friends to obey the law? 


Reproduced by permission of the California Test Bureau. 
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Intermediate Series, grades 7-10 

Secondary Series, grades 9-college 

Adult Series ` 
Over-all time for administering a test battery is epproximately 50 
minutes. 


Scoring. Answers can be recorded in the test booklet itself or 
on a prepared answer sheet. Scoring is objective and easy. A 
profile of the different scores and over-all adjustment score can 
be constructed. The pupil's earned score (point score) is entered 
opposite the personality component and the percentile rank cor- 
responding to this score is found in the appropriate table. Per- 
centile ranks for total personal adjustment and for total social 
adjustment may also be entered. 

The reliability of the five batteries is quite high (.80-.94). 
Percentile norms are provided for each sub-test and for the 
battery as a whole. This inventory is 2 useful indication of a 
pupil’s all-around adjustment. Diagnosis from the sub-tests is sug- 
gestive rather than conclusive; but many valuable clues which 
serve to explain a child’s behavior may be obtained. 


Aspects of Personality (Pintner)* 


Description. This inventory consists of three parts: ascendance- 
submission, extroversion-introversion, and emotionality. It-was 
designed to aid the classroom teacher in locating children who 
have developed—or are likely to develop—serious behavior prob- 
lems. Samples of the kind of items found in the test are as follows: 


I have a lot of nerve Same-Different 
I like to read before the class Same-Different 
I feel tired most of the time Same-Different 
When a child tries to push into line ahead of me, I " 
am not afraid to tell him to get back Sarhe-Different 


The pupil indicates his agreement or disagreement with a state- 


ment by marking or circling “same” or “different.” 


* Published Ьу the World Book Company, Yonkers, N. Y. 
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Scope. Aspects of Personality is intended for use in the elemen- 
tary grades and junior high school. 


Scoring. This inventory is readily scored by means of a stencil. 
There are separate norms for boys and girls, and for two levels 
of maturity. A low score on the ascendance-submission part in- 
dicates a shy child, a high score an aggressive one. High scores on 
extroversion-introversion suggest good adjustment; low scores 
suggest withdrawing tendencies and daydreaming. Low scores 
on emotional stability suggest flightiness and lack of control. The 
total score is a rough index of personal adjustment, and probably 
only wide deviations should be investigated. The test often pro- 
vides useful clues. 


Personal Profile and Personal Inventory (Gordon)* 


Description. The Personal Profile is designed to measure four 
fairly distinct personality traits. (a) ascendancy, (b) responsi- 
bility (perseverance or reliability), (c) emotional stability, and 
(d) sociability. The examinee is asked to indicate which of four 
statements (there are eighteen sets) is most descriptive of himself 
and which is least descriptive. A specimen set is 

Able to make important decisions without help 

Does not mix easily with new people 

Inclined to be tense or high strung 

Sees a job through despite difficulties 
Each of these phrases is descriptive—positively or negatively— 
of one of the four traits included in the inventory. 

This personality questionnaire uses what has been called the 
“forced-choice” technique—that is, the examinee is instructed to 
choose between statements two of which appear to be equally 
acceptable and two equally unacceptable. (See the description of 
the indirect questionnaire on page 164.) This method of pre- 
senting items has certain advantages. If the two choices are fairly 


well equated for social value, it is hard for the examinee to fake 
at 
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his answer, since he does not know clearly what is behind either 
choice. Again, the use of forced-choices reduces hesitation and 
indecision, since the examinee is required to make a decision, 
rather than simply choosing between Yes and No. If the examinee 
likes none of the choices, he may select the least objectionable. 
In addition to the four trait scores, the Personal Profile yields a 
total score, which may be represented graphically along with 
the four part scores. Very low total scores have been found to 
be associated with maladjustment and poorly developed per- 
sonality. | 

The Personal Inventory also covers four traits: caution, original 
thinking, personal relations, and vigor. The total score depicts 
the student's personal development in these areas. 


Scope. The Profile and Inventory are designed for high schools, 
colleges, and adults. 


Scoring. Percentile norms are available for each scale, for boys 
and girls separately, for high school, and for college. The four 
Scores and the total may be represented graphically on a profile. 
These questionnaires have considerable validity, as is shown in 
follow-up studies. "Together, the two inventories are useful in 
counseling students and in screening out those with potential 
behavior problems. Reliabilities of the sub-tests and of the total 
are satisfactory. 


Adjustment Inventory (Bell)* ' 


Description. This well-known inventory consists of questions 
to be answered Yes, No, or ?. It has been designed to estimate 
personal adjustment in four areas: home (satisfactions and dis- , 
satisfactions), health (illness and general well-being), social 
relations (shyness, aggressiveness, and so on), and emotional be- 
havior (self-confidence, depression, and the like). Samples of 
the kinds of items in the questionnaire are ` 


* Published by the Stanford University Press, Stanford, Calif. 
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Are you troubled with shyness? Yes No ? 
Do you daydream frequently? Yes No ? 
Are you often low in spirits? Yes No ? 


"The Bell inventory has proved useful chiefly in locating students 
who need counseling. It provides valuable leads to social and 
personal maladjustment. 


Scope. 'The student form of the Bell is for high school and 


college students. There is a form for adults which may be used in 
vocational counseling. 


Scoring. The inventory is not timed, but ordinarily requires 
about twenty-five to thirty minutes. The 'over-all reliability is 
high. There are percentile norms for high-school and college 
students and for men and women. 


Thurstone Temperament Schedule* 


Description. This inventory consists of a set of 140 questions, 
twenty items being grouped under each of seven aspects of 
temperament or emotional expression. Adjectives describing the 
seven temperamental traits are: active (degree of energy), 
vigorous (participation irr physical activities, sports), impulsive 
(happy-go-lucky), dominant (aggressive, forthright), stable 
(emotionally), sociable, and reflective (thoughtful, meditative). 
These seven behavior areas, which were identified through a 
study of the intercorrelations of many personality variables, are 
believed to constitute certain basic aspects of social behavior. The 

‘inventory is well adapted for use with normal people: items 
obviously bearing upon mental disease have been avoided. 


Scope. For high schools, colleges, and adults. 


` Scoring. Percentile norms are available for the sub-sections, 
and the seven scores may be plotted on a profile for a study of 
\idiosyncrasies. The reliability of the whole inventory is high. 


. ge o . . p ЕЛ 
However, the reliabilities of the sub-sections are not high. Hence, 
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although the inventory may provide valuable clues to a counselor, 
diagnosis based on part scores should be tentative. 


ATTITUDE SCALES 


When we know that a man is a Socialist or a Christian Scientist, 
we feel fairly sure that we can predict his answers to questions 
dealing with politics or religion. An attitude is а consistent point 
of view, a way of behaving toward an institution, a social group, 
or toward personal, political, or religious issues or practices. 
Attitudes may be fzirly narrow or quite broad, and they may be 
strongly or weakly held. In general, an attitude pivots around 
strong likes or dislikes. A person's attitude toward drinking, 
professional sports, popular music, or "eggheads," for example, 
will be exhibited in expressions of opinion which are often 
emotional. И . 

Scales for measuring the spread and strength of attitudes have 
often been used by social psychologists but are rarely employed 
routinely in the schools. One of the most comprehensive lists of 
attitude scales (about thirty in all) has been constructed by 
L. L. Thurstone and his associates at the University of Chicago. 
"These scales estimate the strength of one's attitude (on either 
the favorable or unfavorable side) toward such diverse matters 
as war, capital punishment, the church, and communism. 

In this section we shall describe two scales both of which 
have been useful in high school and college. These are 

Ascendance-Submission Reaction Study (Allport) 
Study of Values (Allport-Vernon-Lindzey) 


Ascendance-Submission Reaction Study* 


Description. This questionnaire attempts to determine whether 
& person characteristically dominates or is dominated in the face- 
to-face contacts of everyday life. The A-S Reaction Study is 
usually classified as a personality inventory, but it can just as 
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well (perhaps better) be described as an attitude Scale, since it 
tries to discover an individual's habitual way of behaving in 
everyday‘social contacts. There are two forms of the test, one 
for men and one for women. Each item presents a situation which 
might readily be encountered in School, on the street, or in a 
store or bus. From two.to four possible responses are offered. 
The examinee selects that options whick most nearly represents 
what he would ordinarily do. Choices range frem aggressive to 
submissive and are weighted in such a way as to differentiate be- 
tween the two attitudes. Scoring weights for the separate items 
were determined experimentally, and the total score shows the 


strength of the examinee’s typical behavior on a dominance- 
submissive scale. 


Scope. The A-S inventory is designed for use in high schools, 
in colleges, and for adults. - 


Scoring. Scoring is by stencil, the answers being weighted + 
(plus) for dominance and — (minus) for submissiveness. Per- 
centile norms are provided for high-school and for college stu- 
dents, and for adult men and adult women, The A-S Study is 
often useful in educational and vocational guidance. In many 
occupations, such as nursing, teaching, library work, and clerical 
jobs, a strongly dominant attitude is a liability rather than an 
asset. On the other hand, in positions requiring leadership, 
dominant behavior and self-confidence are crucial when decisions 
are to be made. The A-S Study is especially valuable when com- 
bined: with aptitude and other tests. The test has satisfactory 
reliability. 


Study of Values (Allport-Vernon-Lindzey)* 


Description. This questionnaire sets out to gauge the strength 
of six basic attitudes, described as follows: theoretical (marked 
by dominant interest in the discovery of truth, the rational 
approach to life); economic (interests lie in practical affairs); 
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aesthetic (places greatest value on form and beauty); social (chief 
interest in people) ; political (primary interest in power, influence, 
and renown); and religious (Committed to mystical vslues, seeks 
to comprehend the universe). The Study assumes that a person’s 
philosophy of life is revealed by the strength of his basic 


attitudes. See Figure 7-3. 
Scope.,College students and adults. 


Scoring. To score, one simply adds the weights assigned the 
various items. The total score for each of the six Values can be 
plotted on a profile to show graphically the relative strengths of 
the individual's attitudes. Norms are for college students, for 
men and women separately, and for some occupational groups. 
The Values inventory has shown expected differences between 
medical’ and theological students and characteristic differences 
among other occupational groups. The inventory is useful in 
counseling and in personnel selection. It is also valuable in fore- 
casting the direction of a student's attitudes. 


FIGURE 7-3 Specimen Items from a Study of Values 


(Answers are indicated by checking or morking.) 


Theoretical v. Economic: 

1. The main object of scientific research should be the. discovery of truth 
rather than its practical applications. (a) Yes; (b) No. 

Religious v. Social: 

9. Which of these character traits do you consider the more desirable: 
(a) high ideals and reverence; (b) unselfishness and sympathy? 


Aesthetic v. theoretical v. political v. economic: 

10. Which of the following would you prefer to do during part of your 
next summer vacation (if your ability and other conditions permit)— 
a. write and publish an original biological essay or article 
b. stay in some secluded part of the country where you 


соп appreciate fine scenery 
c. enter a local tennis or other athletic tournament 


d. get experience in-some new line of business 


From Allport-Vernon-Lindzey, “Study of Values." Reproduced by permission of Houghton 
Mifflin Company. 
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INTEREST INVENTORIES 


“The interest inventory is essentially a self-report or survey 
covering á'person's own interests, values, preferences, and feel- 
ings over a wide range of activities. The importance of interests 
early became apparent in industry when studies of worker 
efficiency made it elear that job success depends as much on moti- 
vation as orf aptitude and training.*In the'school, knowledge of a 
student's dominant interests is of real significance for the coun- 
selor or teacher. The printed interest inventory supplies sys- 
tematic information about a student's attitudes, feelings, and 
personality trends which otherwise could be’revealed only in a 
long interview, if at all. Students often report patterns of interest 
which differ widely from their stated educational and vocational 
goals. The astute counselor may be able from study of the in- 
ventory to suggest occupational areas which the student hitherto 
had not even thought of. Ы 

The interest inventory is generally more acceptable in the 
school than is the attitude scale or the personal data (adjustment) 
blank. Many students resent the personal questions of the adjust- 
ment blank and sometimes find them emotionally disturbing. For 
this reason, they conceal their true feclings, especially if they are 
not conventional or socially acceptable. An interest inventory is 
not likely to be faked or responded to adversely. Examinees find 
it impersonal, less prying, and often interesting in itself. Hence 
their appraisals are usually honest. From an interest inventory, 
a counselor gets a clearer idea of a student’s occupational aspira- 
tions. He may get, too, valuable clues as to a student’s personality 
trends—for example, his desire for security rather than for ad- 
venture, for active rather than passive roles, for people rather 
than books. 

The best-known interest inventories are vocational and were 
intended for adults, As a result, they are not very useful below 
the eighth grade. This is not a serious disadvantage, however, 
since the interests of elementary: children are often uncrystallized 
and may be superficial, unreliable, and unrealistic. The moving 
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pictures, ТУ, and romantic stories invest certain activities (that 
of the actor, the game hunter, space adventurer, and detective, 
to cite a few) with artificial glamor. Moreover, a pupils informa- 
tion concerning many occupations—time required, aptitudes 
needed, financial returns to be expected—is often meager and 
distorted. Even in high school and college, when choice of voca- 
tion becomes crucial, information on many occupations is not 
available. A number of pamphlets describing the requirements 
for various occupations will be helpful to counselors and students 
(see page 253). Р : e 
This section will describe three interest inventories, one suit- 
able for pupils whose reading level is up to sixth-grade standards, 
and two for high-school and college students and for adults. 


Occupational Interest Inventory 
Kuder Preference Record 
Strong Vocational Blank_ 


Occupational Interest Inventory* 


Description. This inventory provides scores in three interest 
areas. First, there are scores in six basic fields of occupational 


interest: personal-social (personal contacts, service fields); 


natural (outdoor activities, farming); mechanical (machinery 
design, building, constructing things); business (activities of the 
business world, the “profit motive”); arts (music, literature, 
drama); sciences (chemistry, engineering, biology). Second, cer- 
tain items are designated verbal, manipulative, or computational, 
and scores in these areas provide information as to the direction 
of one’s interests. Finally, the attempt is made to gauge the level 
of an examinee’s interests—whether his interests identify him 
with simple routine aspects of a job or with the more expert 
performances and skills. | | 

The six basic fields of occupational interest show considerable 
overlap, and their identity as stricily separate compartments of 
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interest is doubtful. At the same time, the Occupational In- 
ventory does give the counselor a better notion of the intensity 
and range of a person's dominant interests than could be obtained 
from casual conversation, and it furnishes many clues to type and 
level of commitment. Vocational’ or other advice based on in- 
terest scores should always be tentative, however, and subject to 
confirmation from other sources: The-items of the Inventory 
are forced-choice in form. The Manual gives many suggestions as 
to the ways in which results may be interpreted, A few samples 
from the intermediate form of the Inventory will show the kinds 
of question asked. The large letter before 4 question gives the 
interest field, the small letter the interest /evel, and the symbol the 
interest type (whether verbal, manipulative or computational). 


Part I 


Directions: Draw a circle around the letter of the activity you 
prefer. For example, if you prefer to drive an ice- 
cream truck and sell ice cream, draw a circle around 
А. 1 as shown below: 


(& DDrive an ice-cream truck and sell ice cream. 

ОЕ 1 Wrap articles in the shipping department of a store. 
However, if you prefer the second activity, draw a circle around 
F 1. A second item, to be marked according to the same direc- 
tions is 

A К 14 Conduct visitors through art galleries and museums 

E 14 Help build automobiles, ships, or airplanes 


Part ЇЇ 


Directions: Below you will find three activities under each 
; number. You are to choose the one you prefer to do 
of the three in each group. Indicate your choice by 

marking the letter preceding the activity. 


f 11 Design or construct stained glass, metal ornaments or 
plastic figures 
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b 11 Make pottery, statues or book ends 
d 11 Carve wood or stone or make metal ornamental figures 


Scope. The Intermediate Inventory begins with grade 7 but 
may be used with bright sixth graders. The Advanced Inventory 
is for grade 9 and for adults. 


Scoring. Percentile ranks for the six basic interest fields may 
be read from the appropriate tables. Standard scores from con- 
verted raw scores are also provided. Part scores may be repre- 
sented graphically by means of profiles for a clearer comparison 
of interest-strength: The working time for the test is thirty to 
forty minutes. Scoring is simple: the items designated by letters 
and symbols are counted. 


Kuder Preference Record* 


Description. The Kuder Preference Record (Form B1) is a 
widely used vocational interest blank. There are 360 items in all, 
arranged in groups of three. In each set of three, tHe examinee is 
asked to-indicate which activity he would like 77052 and which he 
would like least. Response is made by punching a small hole with 
a pin, and the answer is recorded orra specially prepared answer 
sheet placed under the test blank. The samples below (given as 
examples in the Record) are for illustration: 


Directions: You will find a number of activities listed in groups 
of three on the following pages. Read over the three 
activities in each group. Decide which of the three 
activities you like ost. Note the letter in front of 
it and punch a hole beneath the 1 beside this letter 
in the column at the right, using the pin with which 
you have been provided. Then decide which activ- 
ity you like Jeast and punch a hole beneath the 3 
beside the corresponding letter in the column at the 
right. 

* Published by Science Research Associates, Chicago, Ill. There are 3 forms, 

2 vocational and 1 personal. 


176 Personclity Tests 


Example #1 
1 3 
P. Visit an'art gallery OPO 
1 3 
Q. Browse in a library OQ @ - least 
А i 1 3 
R. Visit a museum Шш “most > GRO 


(The punch in the hole beneath 1 beside R shows that the ex- 
aminee would most like to visit a museum. The punch in the hole 
beneath 3 beside Q means that he would least like to browse in 
а library.) 


Example #2 ` 
1 3 
S. Collect autographs  . Oso 
1 3 
T. Collect coins most > @ TO 
1 3 
U. Collect butterflies OU @ < least 


(The punch beneath 1 beside T means that the examinee would 
most like to collect coins. The punch beneath 3 beside U means 
that he would least like to collect butterflies.) 


All items are of the forced-choice type in that the examinee 
has to make a selection among limited options. But the stipula- 
tion of two choices (most and least) allows some latitude. The 
test provides for ten interest-areas: outdoor (agriculture, nature), 
mechanical, computational, scientific, persuasive, artistic, literary, 
musical, social service, and clerical. The student or young adult 
may be asked to express interest preferences required in many 
vocations unfamiliar to him. In most cases, this is a help to the 
counselor, who can then describe these vocations to his client. 
The Manual provides valuable information about the interest 
patterns typical ef persons successful in various lincs of work. 
The purpose of the Kuder Record is to reveal interest trends 
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over several broad areas rather than in fairly specific occupations. 
There are, for example, fifty-three occupations grouped under 
scientific interests. Items in the Record were first organized on 
a logical basis in the light of everyday experience and common 
sense. Later, items were analyzed statistically in order to isolate 
clusters of items highly correlated. These clusters were taken to 
reveal a core of interest. ° 

The Kuder Record relies for its validity primarily on content 
analysis and logical relations. The number of choices offered 
and their nature sometimes confuse students; and the inability 
to find clear-cut preferences may lead to dissatisfaction with the 
forced-choice aspect of the test. Below the eighth grade the 
reading level is probably too high, and the Record should not 
be used. The fact that the scoring plan does not weight sharply 
strong vs. weak interests has been another criticism of the test. 
At the same time, the Kuder Record is an excellent measure of 
the range of expressed interests, and as such is valuable in edu- 
cational and vocational guidance. It is often possible to point 
out to a student that he has expressed many interests not in line 
with his vocational goals. The median reliability coefficient for 
the nine interest-areas is .91. 


Strong Vocational Interest Blank* 


Description. This was the first vocational interest blank and is 
still the best-known. There are forms for men and women. The 
Blank has gone through several revisions and in its latest form 
comprises four hundred items grouped under eight categories. 
These are occupations (likes and dislikes), school subjects, 
amusements, outdoor and indoor activities, tesponses to peculiari- 
ties of people, choice of activities, comparison of interests, and 
evaluation of personal abilities. The examinee indicates his choices 
by circling or marking. Answers to the items are given numerical 
weights, obtained by comparing the replies of a defined occu- 
pational group (lawyers, for example) with the replies of people 

* Published by the Stanford University Press, Stanford, Calif. 
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in general. In all, forty-five occupations or areas of interest are 
covered by the Blank. 


Scope. For men and women. 


Scoring. A person's score on a given scale (his interest in teach- 
ing, for example) is found by totaling the plus (--) and minus 
(—) credits obtained from the options he has marked. A Separate 
key is used for each vocation. Thus an examinee’s blank may be 
scored for the interests of an engineer, a physician, and a sales 
manager. Point scores are converted into Standard scores to 
afford a direct comparison. A useful scale 


of letter grades is 
also available: A. represents close identificatio 


n of interests with 


the given vocation, B+, B, and B— somewhat lesser agreement, 


and C+ and C a very different interest pattern from that of the 
occupation under study. For example, a college student may have 
А and B scores in the interests of a minister and social worker, and 
C+ and C scores in the interests of a mathematician or physicist. 

Somewhat less time-consuming, and often more valuable than 
specific vocational scores are the scales for interest clusters. There 
are eleven of these clusters, for example, personal-socia] ("uplift," 
interest in social science, interest in becoming a teacher, preacher, 
or school superintendent); zatbematics-science (chemist, engi- 
neer), and business-conmmercial (salesmen, various business inter- 
ests, making money). The use of clusters or occupational families 
provides greater flexibility in the use of the Strong Blank, 

The Vocational Blank is in reality a systemati 
its aid, the counsclor becomes better acquainted w 
and directions of a student’s interests. It may 
young man’s interests differ sharply from the vocational plans 
which his parents have for him. The counselor must then point 
out discrepancies and try to resolve them to the Satisfaction of 
all concerned. 


С interview, By 
ith the strengths 
happen that a 


SUMMARY ON PERSONALITY INVENTORIES 


.Validity. Insofar as an inventory includes 


questions which 
experts agree are relevant to the area being 


tested, the question- 
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naire has content validity. The adjustment inventories (PD 
Sheets) are made up of items drawn from texts on abnormal psy- 
chology and cover conditions which have been found to be 
symptomatic of mental illness. The interest inventories*have been 
validated experimentally against a number of criteria: expressed 
interests of successful professional and businessmen, successful 
completion of training courses, ratings {ог work success, staying 
in an occupation vs. leaving is, and degree of job satisfaction. 
Correlational analysis has been used to locate clusters of items 
Which embrace a common core of interest or a community of 
interest patterns. Follow-up studies of the Strong Vocational 
Interest Blank show that men tend to stay in occupations for 
which they expressed strong interests as students and to change 
occupations for which their expressed interests were weak. 

In using interest inventories, several precautions should be 
taken. It is well to remember that interest and aptitude are not 
the same thing, and that many youngsters express interest in ` 
vocations for which they have little capacity. Again, the interests 
of young people—especially those below the age of 25—-often 
change markedly. Adolescents may express unrealistic interests 
which change drastically later on. More than one determination 
of interests, therefore, should be made. Finally, it must be 
remembered that advice about occupational families is much 
safer than advice about specific jobs. Any inventory, personality 
or interest, should be supplemented by school and intelligence 
records, ratings for health, appearance, motivation and socio- 
economic status. 


Reliability. The reliability coefficients of most inventories is 
high— .80 or more. As interests change over a period of time, 
reliability determinations can be relied оп for short periods only. - 


1 , 

Scaling. Inventories are usually scored by assigning weights to 
the various options presented. These points are converted into 
percentile norms, standard scores, and sometimes letter grades. 
Norms for the adjustment inventories are most often for stu- 
dents, less often for occupational groups. The interest inventories 
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report norms for occupational families (Kuder) and for specific 
occupations (Strong). Test Manuals provide many useful sug- 
gestions for the interpretation of the inventories. , s 


SUGGESTIONS FOR FURTHER READING 


Anastasi, А. rp er Testing. New York: Macmillan, 1954 
Freeman, Е. S. Theory and Practice of Psycbologi "enting 
edition). New York: Holt, 1955. á picar TEE Rev: 
dan, A. M. 7 i 1 
e ап Measurement in Education. New York: McGraw-Hill, 
"Travers, R. M. Educational Measurement. New York: Macmillan, 1955 


SUGGESTIONS FOR LABORATORY WORK 


1. One of the best ways to become acquainted wi ity i 
tory is to take it yourself. Members uf dn class ibi окуы her is 
the questionnaires as are available, score them, and dr dine 
where called for. ‚ РЕЗЕ 

2. Examine the Manual for the Kuder Prefer: i 
What is said about validity, reliability, es coul ule aim: 

3. Study the Manual for the Allport A-S Reaction Sud а Min 
Manual for the Bell Adjustment Inventory. How are the dba i 
данг aai 0 se inventories 


QUESTIONS FOR DISCUSSION 


1. In an adjustment inventory, the number of positive symptoms b 
comes the score. What is meant by saying that Stanley is on ОУ di e- 
for a personal data sheet? i median 

2. Which interest blank, the Kuder or the Stri i 

h ong, is more i 
for high-school students? 5 Appropriate 

3. How might an interest inventory be used in studying a child's per. 
sonality trends? рее 

4. How closely related are interests and aptitude? Does the relation 
ship change with age? Р 

5. Why are personal data blanks of little value when admini 
« د‎ stered as 

group tests”? 

6. Under what circumstances do you think the interest inveni 
would be most helpful? At what age levels? Give reasons fo. to 
answers. ку 


ry 


our 
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7. What factors limit the usefulness of paper-and-pencil adjustment 
inventories? 

8. A high-school senior expresses a strong interest in engineering, but 
his interest inventory score does not confirm this interest. What would 
you as counselor suggest to him? 

9. The Strong Vocational Interest Blank has a key for various specific 
Occupational interests—dentist, banker, carpenter, for example. What 
difficulties do you see in such restricted interest patterns? 

10. Why is a personal data sheet easier to fake than an interest blank, 
even when the items are not forced-choice? d 


CHAPTER 8 


OBJECTIVE-TEST ITEMS AND 
SHORT-ANSWER TECHNIQUES 


There are at least two reasons why the teacher interested in 
guidance should be familiar with the main types of objective test 
items. In the first place, the most widely used standard group 
tests are made up of items of the objective sort. (See Chapters 
4 and 5.) Hence, a knowledge of the strengths and weaknesses 
of objective questions will enable a teacher to make a more dis- 
criminating choice among several tests of intelligence, educa- 
tional achievement, or aptitude proposed for use in a school. 
Second, a teacher-made test is greatly improved when the teacher 
knows the principles which govern the writing of objective type 
items and the assembling of them into a test. 
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This chapter will describe some of the better-known—and 
more widely used—verbal objective type items. These include 
the true-false, multiple-choice (best-answer), matching; comple- 
tion, and short-answer essay questions. The advantages and dis- 
advantages of each of these item types are listed and examples 
given to illustrate errors to be avoided in writing items of each 
type. Objective tests employ numbers, geometric forms, pictures, 
and diagrams, as well as words. (Figure 8-1 provides a number of 
illustrations.) Some of the varieties of non-verbal items frequently 
encountered in standard tests fall under the following heads: 


1. Number Series Completion.* The examinee is asked to com- 
plete a series of numbers—which are related in some way— 
by the addition of one or more appropriate numbers. 

2. Figure Completion. The examinee must complete a figure by 
the addition of a line or other detail. ' 

3. Likenesses and Differences. From 2 list of pictures showing 
objects or activities, the examinee is required to select several 
which belong together, or to select an item which does not 
belong with the others, 

4. Picture Completion. The examinee is to complete a picture 
from which one or more items have been omitted. 

5. Errors in a Drawing. The examinee must locate and correct 
errors in a drawing. 

6. Arranging Pictures. The examinee is to arrange a set of pic- 
tures in orderly fashion so as to tell a story. 


Most non-verbal items are variations on the multiple-choice 
type. Non-verbal items are frequently used in tests designed for 
young children. (See page 119.) 


Comparison of Objective Items and Essay Questions 


The traditional essay question often covers too much ground, 
and is open to large errors in scoring and interpretation. Con- 
sider the question “Discuss the canses of the War of 1812” as 


* This test is also classified as verbal. 


FIGURE 8-1 Objective-Test Items 


Directions: "he examiner reads several statements about each set of pictures. The 
student is told to put a + 


Lt п in the ( ) after the number 
2) эз Р i 

S 95 if the statement is true, а 
eC» O if it is false. 

[TT Р 

e Jo Example: The examiner 

ч С n reads, "Un cheval vient 

*( )» de s'abbattre sur la route.” 
0.0 )ю 
un( Fa The student puts a + or О 
зъ( Уп in the ( ) after the number 
3) уз of the statement. 
ае ju 


Directions: What chemical 
property determines which of the materials 
Coal Wil! be placed nearest the ash pit? 


Wood. (1) heaviest residue (4) kindling temperature 
Paper (2) reduction of coal (5) paper 
(3) ashes given off (6) combustion 


Answer ( ). Student puts in the number of the answer. 


Directions: Each of the followihg incomplete statements or questions is followed 
by 5 possible answers. For eadh item, select the answer that best completes the 
question and write its number of lette on the line to the right. 


31. A claw hammer is shown in picture (b) polish metal, (c) drill holes, 


13678 (d) take dents out of metal, 
32. A chisel is shown in picture (e) caulk metal. 
24567 —— 40. Тоо! #2 can be used to (a) mork 
33. A ball peen hammer is shown in metal, (b) file metal, (c) drive 
pitue 13568 anne а screw, (d) fasten a bolt, 


39. Tool #1 can be used to (a) file metal, (e) lock. a bolt. 


Represented by Picture, Drawing, or Diagram 


ME CC с == аа Ж. б 


YOu 


B' Directions: If the two TRUE FALSE. CANNOT 
equal circles whose TELL 
centers are О and О” 
B have <AOB= «C A'O'B' a g L1 
A then arc AB-orc А'В'. , 
А ° d 
ap Cy 


Directions: Which of the five figures can be made from the pattern in Exomple X? 
More than one may be correct. 


Hoole 


Directions: The first three pictures in each row are alike in some way. 
Decide how they are alike, and then find the one picture among the four to the 
right of the dotted line that is most like them and mark its number, 


ibi 


Directions: Mork two things good to eat. 


—_—— 
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an example of 2 common form of essay question. Answers to this 
question will almost certainly include material that is true and 
relevant, ,material that is ambiguous, material that is clearly 
erroneous, and: material that is mostly padding. It becomes well- 
nigh impossible for two or more readers to evaluate the answers 
to such a question in the same way. However, when choices in 
objective test questions are recorded by checking one of several 
possible answers, circling a number, or underlining a word or 
phrase, the grade or: the test will be the same whether the scoring 
is done by a clerk or by an expert. And the answer will be right 
or wrong. 

Examinations composed of objective items possess several other 
advantages over questions of the essay type. The objective item 
not only eliminates unreliability due to personal opinion but is 
the more easily scored, is economical of time, and allows for a 
wider sampling of material. Furthermore, the objective test item 
forces the student to answer a question directly, gives him little 
opportunity to equivocate or dodge, and is, for that reason, a 
more dependable measure of what a student knows. On the nega- 
tive side, the objective item may provide little opportunity for 
the examinee to display his understanding and organizing ability. 
When poorly made, the objective item may lay too much stress 
on rote memory and unrelated bits of information. 


Defining the Purpose of the Test Item 
It is necessary to keep constantly in mind the purpose we 
intend our test items to serve. Items may then be selected—or, 
in a standard test, examined—with these objectives in view. We 
cannot always be sure, it is true, of exactly what a given itent 
is measuring. But we can sharpen our aim by setting up definite 
specifications (page 211) which we want our items to meet. For 
example, an item should: 
1. Elicit information (ofteu fairly specific) which reveals an 
understanding of a process, principle, situation, or historical 
movement. й 
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2. Require the examinee to demonstrate knowledge and use of 
technical terms and concepts. 

3. Give the examinee a chance to show his ability to apply a 
principle in the solution of a problem, draw a conclusion, 
arrive at a generalization. 

4. Call forth responses which will reveal the examinee's atti- 
tudes, interests and personality traits. ' 


Not every item, of course, can be fitted neatly into one of 
these categories. Some (many, we hope) will cut across several. 
Nevertheless, each item should be written to achieve a definite 
purpose, to call ou? some important bit of knowledge, under- 
standing or application. 


Assembling Test Items 


In the process of making an objective test, the type of item 
to be used must be decided upon and the items written, before 
we are ready to assemble them into tentative form and try out 
the test. Several problems arise: determining the difficulty of 
the items and their discriminative power, drawing up directions, 
and preparing a key and scoring sheets. Methods for carrying 
out these procedures will be treated in Chapter 9. 


TRUE-FALSE ITEMS 


The true-false test presents a series of statements or questions 
each of which is to be marked “Т” (true) or “F” (false). Instead 
of circling one of the letters “Т” or “F,” the examinee may be 
asked to circle “Yes” or “No,” or to write + (plus) or — (minus), 
or in some other way to designate a positive or negative answer. 
One of the earliest objective forms, the T-F test is still widely 
used in group intelligence as well as in educational achievement 
and aptitude tests. It has been criticized as being a measure of 
rote memory, a test of detached and unrelated facts, and as often 
being ambiguous and equivocal. Such strictures are justified when 
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the test is poorly or carelessly made. There is a large element of 
guessing in T-F tests, too, and good items are not easy to con- 
struct, however simple the process may seem to be. But when 
well made, T-F items have valuable possibilities arising from 
their scope and flexibility. The chief advantages and disadvan- 
tages of the T-F item may be summarized as follows: 


Advantages: 


1. It may be used with a wide variety of materials. 

2. It may be scored easily and objectively. 

3. It is the easiest objective type to construct. 

4: It makes possible an extensive sampling of material in a rela- 
tively short space. 

5. It is a tinte-saver, thus allowing for frequent testing. 

6. The directions are readily understood and followed. 


Disadvantages: 


1. It is often ambiguous and confusing. 
2. It is open to guessing and to chance effects. 
3. Much subject matter cannot be stated as unequivocally true 


or false. 

4. It may readily become a test of detached and unrelated bits of 
information. 

5. It may. overstress rote memory at the expense of under- 
standing. 


Some of the rules useful in constructing teacher-made tests are 
given below. In judging the adequacy of printed T-F items, it 
will help to note whether these rules have been observed. 


1. Putting the symbols “Т” and “F” before each question is 
preferable to having the examinee write the letters at the end of 
a statement, thus scattering his answers over the page. Circling 
or marking saves time in scoring test papers and leads to fewer 
errors, since the letters written by an examinee are not always 
legible. See examples. 


Assembling Test Items * 191 
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On the test paper: On the answer sheet: 
1. LDF’ 


T 2, amass 2. T (B 


2. Make the number of true statements equal to the? number 
of false statements. The scoring formula for T-F items is 


Score = Right — Wrong | 
от Score = Total — 2 X Wrong 


Either of these formulas corrects for guessing, and both give 
the same result provided the pupil has tried all of the items. 
Suppose for example that there are sixty items in the test, and a 
pupil gets forty right and twenty wrong. Then his score is 40 — 20 
or 20, or 60 — 2 X 20 or 20. If the child does not try all of the 
items, the two versions of the formula will not give the same 
result and the first (R — W) should be used. 

If an examinee guesses at every item, he should have one-half 
of the items right and one-half wrong, and his score (R — W) is 
properly zero. If an examinee attempts only thirty out of forty 
items in a given examination, his score may be corrected to a 
total of 40 by adding one-half of the untried items, that is, half 
of 10, to his number right. (Presumably he would get one-half 
of the untried items right by guessing.) It is not necessary to 
correct every paper to the number of items in the test. But test 
scores for a class are the more fairly compared when all are 
based upon the total number of items in the test. 

The correlation between number right and (R — W) is per- 
fect when all of the items of the test have been tried. Hence, 
when a child’s score has been corrected to the total, number 
right may be taken as the score instead of (R — W). The ques- 
tion of whether to tell an examinee to guess has excited much 
controversy, partly because of the opprobrium attached to the 
term guessing as related to school examinations. A good general 
rule is to instruct the student to omit only those items which he 
is sure he doesn’t know, to try an'item even when not entirely 
certain of the answer, but never to:guess wildly. Since the exam- 
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inee has been exposed, at least, to the subject matter of the test, 
the chances are better than even that his answer will be based 
on some information, even if it is vague and uncertain. Hence, 
a T-F zaswer is more likely to be right than wrong. 


3. Avoid opinionated and trivial (or trick) items. 


Examples: T F Character is more important than intelligence. 
T F The ABC Test, of Mental Maturity contains 75 
items arranged into 6 sub-tests. 
T F William Collins Bryant is the author of Thana- 
topsis. 
T F One-half of a perfect correlation is .50. 


The first of these items cails for a value judgment, which may 
be true or false; the second and third ask for trivial information; 
and the fourth is a trick questions which happens to be false. 


4. Avoid ambiguous statements, those partly true and partly 
false, and those containing negatives, especially double negatives, 
Examples: T F Socio-economic factors are often the cause of 

war. 

ТЕ William Jennings Bryan, the great Commoner, 
was twice elected president of the U.S. 

T F Not every teacher is careful to avoid having a 
student dislike his subject. 

Т Е Not all instincts are maladaptive. 


"The first item is ambiguous; the second is partly true and partly 
false; and the 3rd and 4th are confusing because of the negative 
form in which they are stated. Double negatives are especially 
bard to decipher. 


5. Avoid textbook language and verbatim quotations. Such 
items encourage rote memory and are often ambiguous when 
taken out of context. 


Examples: T F The role of the teacher is to help the pupil es- 
tablish satisfying goals. 


* 
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T F Heredity determines what a man can йо, en- 
vironment what he does do. 


Textbook verbiage aids in making a correct guess. 


6. Avoid specific determiners, such as all, none, always, never, 
every. Broad generalizations introduced by these words are 
usually false. . ; 

Examples: Т F Feeblemiridedne’s is always present in delin- 
quency. i 
T F Corporal punishment is never justified. 
T F All ministers lead lazy lives: 


e 
These items are all too general and all are incorrect. 


The T-F item is not so popular among teachers as it was 
formerly, and it is not found so often in standard tests. It is still 
ranked high, however, and is perhaps the quickest way of sur- 
veying a wide range of material. When süpplemented by other 
test forms, T-F is a valuable objective item. 


MULTIPLE-CHOICE OR BEST-ANSWER ITEMS 


The multiple-choice item consists of a statement, question, 
phrase, or word followed by several responses only one of which 
is correct. Multiple-choice is one of the most flexible of the 
objective-recognition-type forms. It is a favorite with teachers 
when making their own examinations, and is most widely em- 
ployed in the standard printed forms. Multiple-choice items can 
be so constructed as to measure information, comprehension, 
understanding of principles, and ability to interpret data. The 
test form is applicable to most subjects and to most materials. 

Some of the strengths and weaknesses of the multiple-choice : 
item can be summarized as follows: 


Advantages 

1. Answers are objective and are rapidly scored. 

2. Items may be written to measure inference, discrimination, 
and judginent. 
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3. Guessing is minimized when four or five choices are allowed. 

4. Items may be constructed to measure recall as well as recog- ^" 
nition. 

Disadvantages 


l. Items are often too factual, stressing memory unduly. 

2. More than one response may be correct or very nearly correct. 

3. It is difficult to exclude clues. T 

4. Distractors—that is, incorrect but plausible answers—are 
often hard to find. i 


Rules for constructing multiple-choice items for teacher-made 
tests and for judging the adequacy of such items in printed tests 
are as follows: 


1. Vary the position of the correct response: put the right 
answer in the first, second, third, fourth positions equally often. 
A scoring formula for multiple-choice items is 

Score = Right — (Wrong) 

j (n—1) 
in which z — the number of choices, usually four or five. This 
formula is used to correct for guessing on the assumption that 
each response is equally likely. This conjecture is correct when 
the examinee has no idea of the right answer; but in educational 
achievement examinations as well as in other tests, it is a ques- 
tionable hypothesis. Distractors differ greatly in plausibility and 
likelihood; and since the student presumably has some knowledge 
of the question, he is more likely to mark the right than a wrong 
answer. In most educational achievement tests, taking the num- 
ber right as score saves time and is accurate enough for most 
purposes. It must be remembered that in a given test the number 
of options must be the same for each item if the above correction 
formula is to be used. 


2. Do not include responses which are so unlikely or implausi- 
ble or so unrelated to the question as to give the answer away. 
Distracting responses should distract, not confuse. 


& Examples: The function of a flower is to 
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give pleasure to mankind 

attract insects : • 
illustrate the modification of leaves 
produce seed 


The capital of the United States is 
Washington 

== Romie 

— Tokyo 
Loadon 
Honolulu 


The principal crop of Iowa is 
— pineapples 
corn Р 
oranges 
— bananas е 
In the first example, assuming the fourth chóice to be the cor- 
rect one, the distractors are all rather silly. In examples two and 
three, an examinee would have to be almost totally ignorant of 
geography to be taken in by the distractors. 


3. Do not provide wrong answers which are plausible enough 
to mislead the good student because they are close to the right 
answer. The good student is often led astray by knowing a good 
deal—but пог quite enough—about a question, whereas the poor 
student does not know enough to be misled by a plausible but 


wr ong answer. 


Example: What was one of the important immediate results of 
the War of 1812? \ 
the introduction of a period of intense section- 
alism : 
destruction of the U.S. bank 
— Á— defeat of the Jeffersonian party 
—final collapse of the Federalist party 
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The fourth response is keyed as the correct опе. But 39 per cent. « 
of eight hundred high-school pupils, all of them superior stu- 
dents, checked the first option as correct. Apparently the first 
answer is plausible to students who know a good deal about the 
War of 1812. 


4. Do not give away the correct answer by providing clucs 
such as (a) familiar textbook phrases, (b) having the right 
option consistentiy longer or shorter than the wrong options, 
(c) repeating the words of the question, (d) asking questions 
to which the answer must be singular or plural, with only the 
correct response being in the right number. 


Examples: In what major labor group have unions been organized 
on an industrial basis? (Circle one letter.) 
A. Congress of Industrial Organizations 
B. Railway Brotherhoods 
C. American Federation of Labor 
D. Knights of Labor 
E. Workers of the World 


"The meaning of the German word Gestalten is (Check 
one) 


a response a 
a just-noticeable-difference 

a stimulus 

configurations 

—— a perception 


A man hears a loud noise and runs to the window. 
This is an example of 

motivation 

memory image 

———stimulus-response 

posthyprotic suggestion 

purposivs behavior 

In the first of these examples, the adjective “industrial” in the 
question gives the answer away. In the second, if the student 
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, ; 2 
knows that Gestalten is the plural of the German word Gestalt, 
he has the answer as “configurations.” In the third, the textbook 
phrase "stimulus-response" is a clear clue. à = 


5. In а multiple-choice vocabulary test, none of the response 
choices should be as difficult as the test word. The difficulty of 
response words can be determined from their frequency in 
Thorndike’s Teachers’ Word Book. Response words should be 
of the same part of speech as the test word, and only one should 
be correct. 


Good example: An irksome task is (a) pleasant, (b) engrossing, 
(c) instructive, (d) wearisome 

Poor example: Do not despise him means do not (a) hate, 
(b) malign, (c) deprecate, (d) dessicate him 

In the second example, some ,of the response words are more 

difficult than the test word. This is not true of the first example. 


6. Direct questions or statements followed by a series of 
options are usually clearer than questions in which the answers 
are imbedded in the statement. In the latter form, the examinee 
must read through the statement for each option. 


Good example: A 10-year-old receives a percentile rank of 40 
on a test of arithmetic. This means that 
he is above the mean of 10-year-olds on 
the test. 
— he exceeds 60 per cent of 10-year-olds. 
———40 per cent of 10-year-olds did worse 
than he. 
61 per cent of 10-year-olds exceeded his 
score. 

Poor example: Percentile rank shows the per cent (a) at or 
above, (b) above, (с) at, (d) below, (е) at or 
below the given score. 

The second example is more difficuit to decipher than the first. 


A test made up of multiple-choice items takes more time to | 
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construct than a test of T-F items. Furthermore, good multiple- 
choice items are harder to prepare than T-F items, since it is 
often difficult to find acceptable distractors. The advantage of 
the T-F item is largely offset, however, by the fact that multiple- 
choice items are more searching and demand a more organized 
knowledge of-the subject-matter. Multiple choice is regarded 
by most test experts as the best’ of the short-answer forms. 


MULTIPLE-RESPONSE ITEMS 


The multiple-response item is a variation of the multiple- 
choice type of question. Essentially it presents a statement OF 
topic followed by a number of possible answers, several of which 
may be checked as correct: The multiple-response examination 
tests several aspects of a subject, and is useful in obtaining infor- 
mation from tables nnd charts. This examination form is often 
called a check list. The advantages and disadvantages, as well as 
rules for construction given above for multiple-choice, apply to 
multiple-response items as well. 

"Two examples follow: 


Example: Under each of the following psychological doctrines, 


viewpoints, or systems, indicate by a cross (x) those 
implications or consequences which are characteristic 
of that doctrine. 
1. English Associationism 
a persisting self 
—— summation and integration of mental states 
universal categories of reason 
mental faculties 
persisting motor-response systems 
2. Purposive Psychology (McDougall) 
imageless thought 
introspection as the primary method 
S-R units 
motivation in terms of instincts 
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. — — doctrine of the unconscious o 


conative tendencies 
Each of these items can be described by more than one choice. 


MATCHING ITEMS 


In the matching test, one list of words, naraes, phrases, for- 
mulas, or statements is tû be rfiatched against another list. The 
test may consist of (a) a list of names in, one column to be 
matched against a list of achievements in a second column; 
(b) a list of terms to be matched against a list of definitions; 
(c) labels to be matched against charts and diagrams; (d) authors 
to be matched against books, dates and events. 

The matching item possesses the advantages of interest and 
variety as well as ease of scoring. It is, furthermore, somewhat 
easier to construct than the multiple-chojce item. Matching has 
been frequently used to test the relationship between dates, events 
and various facts. On the negative side, the matching item often 
measures recognition memory rather than understanding, and is 
especially open to clues. Nor do matching items ordinarily test 
ability to organize facts or to apply principles. 

Rules for making up matching test items may be set down as 
follows: 


1. Do not include too many items in the lists: 10 or 12 is the 
maximum, 5 or 7 often cnough. When lists are long, examinees 
must spend too much time hunting through them. Have the 
number of items in the column from which selections are to be 
made larger than the number in the list to be matched, This 
lessens the chances that an examince will match an item correctly 
by a process of elimination. 


Example: The following statements are representative of differ- . 
ent schools of psychology. In the blank spaces before 
the statements, write the number of the psychologist 
for whom the statement is typical. 
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(13 Adler (7) McDougall 
2) Angell (8) James Mill 
(3) Calkins (9) Pavlov 

(4) Freud (10) Titchener 
(5) Jung (11) Watson 
`(6) Koehler (12) Woodworth 


————Sensory processes have the attribute of clear- 
ness, just as they have quality and intensity. 

— There is evidence for the existence of. three 
types of native and unlearned emotional reac- 
tions—fear, rage and love. 

The inadequacy, the relative futility, of all 
attempts to ignore the purposive, the goal- 
secking nature of behavior renders behavior- 
ism untenable. 

Any mechanism, except perhaps some of the 
most rudimentary that give the simple reflexes, 
once it is aroused, is capable of furnishing its 
own drive and also lending drive to other con- 
necting mechanisms. 

The will-to-power is the great motive in men- 
tal conflict. 

The superego represents the repressions of in- 
stinct and dominates the ego. 

—— Mind is primarily engaged in mediating be- 
tween environment and the needs of the or- 
ganism. 

_ Sensations are one of the primary states of 
consciousness; ideas are the other. 


2. Select materials from one subject-field only, so that a given 
item in column 1 has several plausible matches in column 2. 
Explain clearly the basis of the matching. 


Exemple: In column 1 are woids which illustrate a number of 
parts of speech; in column 2 is a list of various parts 
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of speech. Determine what part of speech a word is and 
then identify it by putting its number before the 
proper item in column 2. For example, “boy” is a 
noun, and if "boy" were numbered 5,a 5 would be 
placed before the word “noun” in column 2. Arrange 
the choices in alphabetical order. 


(1) and > adjective 
(2) cat — — adverb 
(3) rapidly noun 

(4) jump preposition 
(5) from. verb 

(6) rich 


(7) either 


Match the items in column 1 with the appropriate 
items in column 2. 


A. Harvey contagious disease 

B. stomach digests food 

C. poison discovered circulation of the 
D. Galen blood 

E. lungs early Greek physician 

F. heart — ——supplies oxygen to the blood 
G. measles i 


The first example is quite easy. But it should enable a teacher to 
spot grammatical confusions. All of the material is from the field 
of grammar. The second item is poor owing to heterogeneity in 
the list of choices (names and bodily organs). 


3. Arrange names in alphabetical order, dates and numbers in 
sequence in order to save the examinee’s time. 


‚ Example: Select the inventor from the first list and put his num- 


ber opposite his invention in the second list. 
(1) Colt —A tlantic cable 

(2). Edison cotton gin 

(3) Field electric starter 
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(4). Franklin — sewing machine 


(5) Howe — steam engine 

(6) Kettering — — wireless telegraphy 
(7) Marconi 

(8) Watt 


(9) Whitney 


4. Avoid clues, for instance, one singular item in both lists, 
the others plural; one item in the list of a different part of speech 
from the.others. Watch for irrelevant (but revealing) associa- 
tions, such as nationality, which give away the matching—for 
example, if the examinee knows ‘that a certain discovery was 
made by a Frenchman, he will look for a French name. 

The matching item is compact and usually interesting to stu- 
dents. It enables a teacher to cover a wide territory in fairly 
short time. Matching is well suited to rapid surveys of specific 
aspects of a field when persons, events, or definitions are wanted, 
or when these constitute necessary knowledge for further work 
in the subject. 


COMPLETION ITEMS 


In this test form, sentences are presented from which certain 
words or phrases have been omitted. Instructions are to fill in 
the blanks so as to complete the meaning. Completion requires 

‚ recall primarily, but it also demands thought and the ability 
to perceive over-all relationships. Little opportunity is afforded 
for guessing. The chief disadvantage of this test form lies in the 
scoring, which is not entirely objective and is often time-consum- 
ing, and in the fact that too many blanks confuse the examinee 
and make a puzzle out of the test. Completion has been a favorite 
of teachers in their own examination-making, although it is not 
so widely used today as multiple-choice and T-F items. 

Rules for writing completion items and errors to be avoided 
in such items are as follows: а 


ТЕ Ро пог copy sentences and paragraphs directly from the 


— — A 
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textbook, since this puts too much emphasis on rote memory and 

parrot-like learning. Rephrase the language of the text, if that is 

used. . Y 

Example: Human behavior, more than that of any other animal, 
isa product OF ase a oa 


Example: Much learning is by trial and ............. 

The first is a poor item. It is out of the textbook and will be 
known by those who recall the textbook language. The second 
is also a poor item—the pat expression “trial-and-error” gives it 
away. в 


2. Тоо many blanks make it impossible for the examinee to 
get the meaning, especially if the sentence is short. 


Example: Civilized man ......... ; uncivilized man ......... P 
This item actually appeared on a printed test. It is impossible to ` 
complete it, or else it can be completed in a wide variety of ways, 
most of them not indicative of much knowledge. 


3. Scoring is more objective if words rather than phrases are 
deleted. Blank out key words—those which carry the meaning of 
the sentence or paragraph—not unnecessary elements or the 


articles a, an, the. 


Examples: Democracy is that form of ............ e in 
э HIC ао thé ые ше.» = аж» э. ti exercise the 
Fe реж лава power through... $775. «nae 
elected): Бу элыш eee wana ees 
Democracy s CHAE a4 ЖК socom aad ae ee of gov- 
ernment in which .. В aa Of the 
people э oa peepee ......... governing power, 


themselves. 


The first form of the item is the better, since the blanks contain 
key words. The second version deletes connecting words which 
! do not carrv the meaning of the sentence. 
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Esample «ease erem established the first laboratory 
Forte сз wa ace cce Дин ores be study of psychology in 


1 
This is a satisfactory item, if we want to know who established 


the first laboratory in experimental psychology and when it was 
founded. 


4. Make the blanks long enough to permit legible answers. 
Have all blanks of standard length to avoid clues as to the length 
of the completing word. 


`5. When there are several correct answers, provide for these 
in a scoring key. Alternate answers may be weighted for good- 
ness of completion, but a simple right or wrong scoring is ade- 
quate in most cases. A good plan is to allow one point for each 
correct answer, none for an incorrect one. 


6. Guard against clucs by taking care that completions do 
not depend upon (a) grammatical form, (b) pat or textbook 
expressions. 


Examples: Johnny wears his space suit, even when he ......... 
to bed. 
A much discussed question is the relative importance 
of heredity and ................... 


In the first item, the first singular verb is a clue to the number of 
the second verb. The second item tests rote memory, and the 
“pat” expression “heredity and environment” gives it away. 


THE ESSAY QUESTION 


The essay question has been a standby of teachers over the 
years. It is widely used in the “literary” subjects, such as history 
and English, and in the natural and social sciences as well. The 
purpose of the essay question is to elicit understanding, organiza- 
tion and interpretation, rather than to test for detached tidbits of 
knowledge. The form of the essay question is important. Ques- 


tions beginning with “who,” “what,” “when,” and “where” are 


S 


» 
| 
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usually to be avoided when they ask simply for a name (for 
example, Napoleon), a date (1492), an event (Battle of Hastings) 
ог a location (New York City). But such questions are valuable 
when the information asked for is relevant to the solution of a 
problem, the making of an inference, or the interpretation of 
some event. Questions beginning with “why,” “how,” “with 
what consequences,” ore“with> what significance” are to be pre- 
ferred to simple fact questions. Questions beginning with such 
words as “discuss,” “evaluate,” “outline” and “explain” invite— 
and usually get—a mass of detail, some nor relevant. Such ques- 
tions are useful, of course, when we wish to know how well 
an advanced student can select, reject and organize. But they 
are hard to score and are virtually useless in a broad survey or 
for the diagnosis of specific blindspots. 


Restricting the Essay Question 

The essay question becomes objective when cast into short- 
answer form and restricted in coverage. Two methods of con- 
trolling the essay question and rendering it more specific may be 
mentioned. 


Recall Questions. Recall items are essay questions reduced to 
the simplest terms. Usually a question is followed by a blank 
space, varying in length. Answers are restricted to short para- 
graphs, the account of some event, an algebraic equation and its 
application and the like. Recall items resemble the completion 
type, but they provide for fairly free answers and are less 


restricted. 
Examples: (1) Define an invertebrate ................ 7% Жу. 


(2) Name three scientists who contributed (a) to 
atomic theory, and (b) list the major contribu- 
tion of each. Э 
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(3) List three conditions which must hold true if an 
intelligence test is to yield a constant IQ. 


The first item calls for a one-line answer. Items (2) and (3) ask 
for specific but basic knowledge, Compare (3) with the essay 
question, "Discuss the construction of the Stanford-Binet.” 


Problem Situations. A problem is stated, and 2, 3, or 4 specific 
questions are asked, cach focused on some important aspect of 
the situation. 


Example: A. skillful teacher has been characterized as one who 
(a) maintains a permissive atmosphere. 
(b) avoids negative discipline. 
(c) conforms to the wishes of parents. 
(d) does not use repetitive drill. 
Write one paragraph defending or attacking each of 
these propositions—that is, four paragraphs in all. 


Example: A recurring problem in child development is that of 
maturation. Cite the evidence bearing on the problem 
from the following points of view: 

(a) neurological 
(b) co-twin control 
(c) parallel groups 


E 


Scoring the Essay Question Objectively 


Perhaps the major weakness of the essay examination lies in 
the unreliability of its scoring. Scoring can be made more objec- 
tive by the use of the following techniques: 


1. When essay examinations are marked anonymously, there 
is usually better agreement between different scorers. 


2. There is less opportunity for preferences, attitudes, and 
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biases to appear when all papers are read for one question at a 
time rather than each paper straight through. Obviously, com- 
parisons can be sharper with this method. 


3. Before reading a question, the teacher can list the basic facts 
which the question is intended to bring out. Points may then 
be assigned to these aspects of the answer. For example, if the 
question deals with a chémical’ process, the answer list may in- 
clude (a) the necessary equations of the process, (b) the chemical 
elements needed, (c) a diagram of the apparatus, and (d) any 
by-products of the chemical reaction. If tlie question deals with 
English literature, tlie answer шау include (a) the author's chief 
contribution, (b) the cultural setting of the time, (c) the influ- 
ence of the author's work. A check list of key points, with credits 
assigned to each, is a useful technique. Thus, from one to three 
points may be assigned to each part-answer. 


4. If the teacher marks the papers for spelling, writing quality, 
and grammatical expression, as well as for content and organiza- 
tion, credits should be allotted to these aspects of the answer. 

The essay question is a valuable examination form when held 
to one or more defined themes, so that it is scorable. Many 
teachers are so impressed by the general use of objective-type 
items in the standard tests that they are inclined to drop the 
essay entirely. This is a mistake. Many courses, especially ad- 
vanced courses, in literature and in science employ objectivc-test 
items as a first approach to an examination of the subject. But 
the essay question is the best (perhaps the only) way in which 
a teacher can determine whether a student can organize his 
knowledge and arrange his arguments in logical fashion. Short- 
answer forms should be regarded not as substitutes for the essay, 


but rather as supplementary to it. 


SUGGESTIONS FOR FURTHER READING 
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York: Préntice-Hall, 1954. 

Travers, К. M. How to Make Achievement Tests. New York: Odyssey 
Press, 1950. 

Wrightstone, J. W., Justman, J., Robbins, 1. Evaluation in Modern 
Education. New York: American Book, 1956. 


Note: Most textbooks on educational psychology and in measurement 
and evaluation contain chapters dealing with objective items. 


QUESTIONS AND PROPLEMS 


1. Write five true-false items in sorne subject field familiar to you. 

2. Rewrite the five items in number 1 in multiple-choice form. 

3. Construct five matching items in which great scientists, poets, and 
authors are matched against outstanding contributions to science or 
literature. Е 

^. If possible, put the items in number 2 іп completion form. 

5. Rewrite the following essay questions to make them more objective 
in answering and in scoring. 

a. Discuss some of the proposals for aiding the gifted child. (Hint: 
break down into specific proposals, such as special classes, accelerated 
promotions, extra assignments, and the like.) 

b. Evaluate three of the modern learning theories. (Hint: This topic 
may be subdivided under descriptive labels—behaviorism, for example—or 
under names of well-known theorists.) 

с. Discuss the causes of the industrial revolution. 

6. Point out any errors in the following items: 


1, The Frenchman who developed the first successful intelligence 
test was (1) Kuhlmann (2) Terman (3) Binet (4) Wundt 

2. An efficient man is one who is (1) strong (2) handsome 

. (3) angry (4) pusillanimous (5) capable 

3. T F Edgar Anderson Poe wrote the poem “The Raven." 


4. T F Lack of emphasis on the three R's is not a serious defect in 


modern educational practice. 

5. Т Е Strict application of the Golden Rule will make for better 
living.- 1 

6. T F The median of a distribution of scores is the midpoint, 
which is influenced inarkedly by very high or very low 
scores. 


- 


м 


10. 


11. 


12. 


13. 


14, 
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. The technical name for nearsightedness is .......... ret 


to an arbitrarily 


The best way to spend one's leisure’ time is to (1) read good 
books, (2) look at TV, (3) dig in the garden, (4) play solitaire, 
(5) relax in an easy chair. " 
The expansion of the binomial (a +b)? is .................... 
1. off 
Iborrowed a book 2. off of my roommate. (Answer)....... 
3. from 
We get the most calories per pound from 
(1) candy (4) potatoes 
(2) carbohydrates (5) proteins 
(3) vitamins 
When there is a fire drill, the teacher must make sure that her 
E vê BE Jia Kae sees ON CR 
RSD TAS eise RR ED ma rii e a stata hs ae 


15. T F The work of Freud has done much to demonstrate that 


associative connections jonce formed are never lost, even 
though under conditions of everyday life they cannot usu- 
ally be recalled. 


p CHAPTER 9 


CONSTRUCTING THE OBJECTIVE TEST 


'Thé classroom teacher needs to know how objective mental 
tests are constructed for much the same reason that he needs 
to know what constitutes a good test item (page 184). Stand- 
ardized tests are now employed routinely in many schools. 
"Teachers who give and score tests will be better able to interpret 
results and to appraise what an author says about his test when 
they understand how the test items were selected and put to- 
gether. Even more important, perhaps, the teacher who knows 
н few essential procedures will be able to improve greatly the 
quality of the day-to-day tests which he makes for his own use, 

"The construction of a comprehensive battery of education] 
achievement tests is not а task appropriate for most teachers and 
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most schools. Standard printed tests in wide use today are made 


‚ by testing bureaus. These agencies have a staff of experts in item 


writing and construction techniques, technically trained assist- 
ants, and access to large and representative samples and to їаБога- 
tory and scoring equipment. The classroom teacher can hardly 
hope to match all this. And fortunately it isn’t necessary, since 
his test-making is properly on a much more modest scale. 

This chapter will outline the*basic techniques in test con- 
struction. These methods apply whether the test is designed to 
measure intelligence, educational achievement, or aptitude. 


“э , 


D 


WRITING SPECIFICATIONS FOR THE TEST 


. Before he begins to construct an examination, the teacher must 
decide what he wants his test to do. This means that he müst lay 
down specifications for the test (page 105). Usually a teacher 
wants to test his students’ knowledge of the fundamentals of the 
subject and to see how well they can use this knowledge in 
solving problems. Three subject matter tests in different areas 
will be described in order to show what specifications the 
author had in mind and how he went about accomplishing his 


objectives. - 


Columbia Research Bureau Spanish Test* 


This test is designed for high schools and colleges. Part I calls 
for basic knowledge of the language, and Parts II and III require 
understanding: of language structure and application of rules of 
grammar. In more detail, Part I is a vocabulary test of one hun- 
dred words in multiple-choice form. The student is instructed 
to mark that one of four or five English words which best de- 
fines the given Spanish word. Part II is a language comprehension 
test. There are seventy-five sentences in Spanish arranged in 
Order of difficulty; each is to be read and marked "True" or 
“False.” Part III is concerned with grammar and syntax. This 


* Published by tlie World Book Company, Yonkers, N. Y. 
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test cons'sts of one hundred English sentences, each followed 
by an incomplete translation in Spanish, which the examinee is 
told to complete. 


California Arithmetic Test (Upper Primary, 
Grades 3 and 4)* 


This test is part of a comprehensive educational achievement 
battery, but it may be given asa Separate examination. Its objec- 
tive is to test for skills in fundamental operations, the identifica- 
tion of consistently made errors, and the ability to apply what is 
known to the solution of problems. The eight sub-tests cover the | 
four fundamental processes (addition, subtraction, multiplica- 
tion, and division), facility and skill in following directions in- 
volving numbers, and simple "mental arithmetic" problems. 


The Nelson-Denny Reading Test** 


The authors state the objectives of this test as follows: to pre- 
dict success in college, to enable a sectioning of college and 
high-school classes on the basis of reading skills, to aid in the 
diagnosis of scholastic difficulties, The examination consists of 
two parts, a test of vocabulary and a test of the ability to read 
and understand fairly difficult prose. There are one hundred 
words in the vocabulary test, each word followed by five choices, 
one of which is to be marked as correct. The paragraph-reading 
test is made up of nine selections of approximately two hundred 
words each. Four questions are asked on each paragraph. There 
are five optional answers for each question, one of which is. to 
be selected by the examinee. It seems clear that the test measures 
basic knowledge of language as well as the ability to use this 
knowledge intelligently. 


SELECTING ITEMS FOR THE TEST 
In the construction of an examination, both the content and 
the form of the question must be considered. 


* Published by the California Test Bureau, Los Angeles, Calif. 
** Published by the Houghton Mifflin Co., Boston, Mass. 
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Deciding On ihe Type of Item b 

The teacher must first decide what type of objective item 
he wishes to use. Truc-False and multiple choice are favorites 
for measuring basic knowledge, and multiple choice, matching, 
completion and essay recall аге all used to assess understanding, 
interpretation and application. It is probably less confusing for 
tlie younger students if a sub-test or section contains only items 
of one type and does not switch from one kind to another. The. 
test-maker should start with a much larger number of items 
than he plans to have in the completed test. All the questions 
should be read by other teachers of the subject and criticized for 
form and for content. items judged to be trivial, inappropriate, 
ambiguous, or too narrow in scope should be revised or dis- 


! carded. The items which survive this preliminary inspection 


should still number considerably more than the number o£ items 
planned for final use. An excess of items is necessary, since some. 
items will always bc discarded as a result of the item-analysis to 
follow. 


Arranging the Items in Order of Difüculty 

The questions are now arranged in a rough order of difficulty, 
from easy to hard. For the first try-out, the difficulty of an 
item as judged by several teachers is sufficient for placement. The 


(i test, as tentatively drawn up, is now administered to a sample of 
students for whom the final test is intended—for example, to 


fifth-grade pupils or high-school freshmen. If several teachers 
of the subject co-operate—and thus increase the size of the 
experimental group—the final test will be a better examination 
than it will be if it is administered to a single small class. It is 
always advisable to get as much information as possible on each 
item. Hence, those examinees who take the examination in pre- 
liminary form should be urged to attempt every item, even when 
they are uncertain of the answer. The time allowance for the 
Whole test should be generous, so that every student will have 
time to try every item. This may make it necessary to have a 


Second testing period. 
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Setting the Time Limit 


The length of the time interval set for the test when put into 
final form will depend on the time available for testing—most 
often one period of about fifty minutes. Time allowances must 
always take into consideration the age of the pupils, type of 
item (amount of computation or reading needed in answering it), 
whether the test is primarily for survey purposes or for diag- 
nosis, and whether speed and/or power are deemed important. 
In examinations wnich are strictly power tests, the time limits 
should be long enough for all but the very slowest examinees to 
finish. Sometimes, naturally, an examination has to be cut in 
length in order to have it fit into the available time. 


ITEM ANALYSIS 


The two characteristics of an item which we need to know 
about in building a test are (a) difficulty and (b) validity, or 
discriminative power. These two determinants of an item's good- 
ness are computed from the same tabulation of the test data. Com- 
putation of the difficulty and validity of an item is called item 
analysis. 


Difficulty and Validity in Item Analysis 


The difficulty of an item depends on how many of the exam- 
inecs in the tryout group answer it correctly. An item answered 
correctly by 90 per cent of the group is obviously casier than one 
answered correctly by 50 per cent or by 10 per cent—the last 
being a hard item. Very hard and very easy items are ordinarily 
less useful than items of intermediate difficulty (page 216). The 
validitv or discriminate power of an item depends on how well 
it distinguishes between the brightest and dullest pupils in the 
group. If all of the members of the experimental group answer 
an item correctly—or if поце does—the item has no validity, 
since in neither case docs it.separate the good from the poor 
members of the class. 
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Biserial r іп Нет Validity* 


"The authors of most standard mental tests have used the biserial 
r method (or some approximation to it) in determining the 
validity of the items in their tests. By means of biserial 7, we can 
compute the correlation between success and failure on a single 
item and size of total score on the test, or on some other measure 
of performance taken as the criterion. The size of the correlation 
between item and test score shows how well the item is working 
together with other items—is a member of a team. Items unrelated 
to total score are discarded. . 

Steps in the determination of item validity by use of biserial r 
are as follows: 


1. Arrange the test papers in order for total score from highest 
to lowest. 


2. Count off the highest and lowest 27 per cent** of che papers 
—if not exactly, as nearly so as possible. If there are 120 children 
in the “standardizing group,” for example, put 32 in the top and 
32 in the bottom groups. ` - 


3. Count off the number in the high group and the number in 
the low group who pass each item, and express these figures as 
percentages. Suppose, for example, that Item 18 is passed by 60 
per cent of the high group and by 30 per cent of the low group. 
Then from tables prepared for the purpose,t we read that the 
biserial correlation between this item and the whole test is .31. 
For an item passed by 24 per cent of the high group and by only 
3 per cent of the low group, the biscrial 7 is .44. In general, any 


* For the computation of biserial r, see references at the end of. Chapter 2. 

** There are good rcasons for choosing 27 per cent. When the distribution of 
ability is normal, the sharpest discrimination between extreme groups is obtained 
when item analysis is based upon the highest and lowest 27 per cent in each 
case. When larger per cents are in the high and low groups, the reliability of 
the determination is higher, but the difference between the two groups de- 
creases. On the other hand, when рег cents in the high and low groups are 
smaller, reliability falls off but the difference between the two groups increases. 

+ See, for example, Item Analysis Table by Chung-Tch Fan, published by the 
Educational Testing Service, Princeton, N. J., 1952. 
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item with a biserial r of .20 or more can be taken to be valid if 
the test is fairly long. In a short test, items of higher validity are 
needed. Both hard items and easy items are valid (that is, have 
discriminative power) if they separate the high and low groups. 
An item passed by 15 per cent of the high group and only 1 per 
cent of the low group (a very hard item), for example, has a 
biserial r of .47, whereas an item passed by 92 per cent of the 
high group and 65 per cent of the low group (an easy item) has 
a biserıal r of .39. Both are good items, though they differ greatly 
in difficulty. 

4. Determine the difficulty of each item by averaging the per- 
centages that pass it in the high and low groaps. An item passed 
by 60 per cent of the high group and by 30 per cent of the 
low group, for example, has a difficulty index of .45—that is, 
E +430 

2 ` 
1 per cent of the low groups has a difficulty index of .08. This 
summary method of obtaining difficulties of items is not as 
accurate as is the practice of using the whole group, but it saves 
time and is precise enough for most tests. 


) and an item passed by 15 per cent of the high and 


5. It can be shown mathematically that items with difficulty 
indices of .50 or thereabouts are the best items, in the sense of 
being able to differentiate among the largest number of good and 
poor students. Not many items, of course, will be found with 
difficulty indices of exactly .50; the range of difficulties usually 
runs from above .9) to below .10. If the test is to cover a wide 
range of talent (and that is what is wanted in most school examina- 
tions), a good plan to follow in sclecting items is as follows: 


Of items passed by 85-100 per cent (very easy) 


take about 15 per cent 
Of items passed by 50- 85 per cent (fairly easy) 
take about 35 per cent 


Of items passed by 15- 50 per cent (fairly hard) 
take about ‘ 35 per cent 
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Of items passed by 0- 15 per cent (hard to very hard) 
take about 15 per cent 


АП of the items should, of course, have satisfactory discrimina- 
tive power. Note that different proportions at the difficulty levels 
follow the normal distribution. 

Items passed by 100 per cent or by nobody have no validity 
in either case, but sometimes ап author will place several very 
easy items at the beginning of the test for psychological effect, 
and a few very hard items at the end to test the very bright pupils. 


6. In using multiple-choice items, it is important for the im- 
provement of the examination to know to what extent good and 
poor students have chosen the various distractors. If the wrong 
answers are illogical, obviously absurd, or otherwise not very 
misleading, the examinee will have little difficulty selecting the 
right option. The item is easier than it would have been had the 
the misleads been morc attractive. Information concerning the 
efficacy of misleads can be obtained by tallying the responses of 
the high and low groups to each mislead, as shown below. The 
group considered is the 120 children referred to in the illustra- 
tion above, and there are 32 (27 per cent) in the high and 32 in 
the low groups. The item is of the multiple-choice type with 
four options, and the correct answer is keyed as (b). 


i Item 26 a ® с d Omissions Total 
High group 1 16. 8 7 0 32 
Low group 3 7 10 12 0 32 


It is clear that distractor (a) needs to be rewritten, since only 
four in sixty-four chose it. Otherwise, item 26 differentiates be- 
tween the good and poor students rather well. 

| A second example shows a slightly different situation. Here (c) 
is keyed as the correct answer. 


Item 10 a b @) d Omissions Total “ 
High group 0 15 11 5 1 32 
Low group ^ 5 10 9 8 0 32 


w 
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" 
Mislead 6b) is chosen by more of the good students than is the 
correct answer (c); and this is true, too, of the poor students. 
Obviously, mislead (b) must be madc less attractive or otherwise 
changed so that it doesn't compete so strongly with (c). Further- 
more, (c) might be strengthened and (a) examined further to 
see why it failed to attract any answers in the high group. 


FIGURE 9-1 Item Analysis Data for Test File 


FRONT OF CARD 


Item 36: What marked change took place in the political status of 
India in the year 1947? 


1. She received a mandate from the United Nations. 


2. She won her independence from Britain. 
3. Her people were united under Mohammedan rule. 
4. She joined the Arab League. 


BACK OF CARD 


Item 36: 1 2 3 Omissions 
High group: 10 32 6 0 
Low group: 19 n 13 2 


€ 
Sample: 200 high school seniors, tested in June, 1953 
Validity: biserial r = .41 
Difficulty: = 39 per cent 


7. Many teachers find it profitable to keep a file of items for 
future use. A good plan is to write the item on one side of a card. 
On the back of the card should be listed (1) the size and char- 
acter of the experimental group on which the data are based, (2) 
the validity of the item (its biserial r with the test score), (3) the 
difficulty of the item, and (4)+data on misleads. Figure 9-1 shows 
these data on an item taken from a test in contemporary history. 

When a teacher has accumulated a large file of items, tests of 


» 
= 
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approximately the same range and validity may be made up as 
needed. 


A SHORT METHOD OF ITEM ANALYSIS 


It is wise for a teacher to understand what the biserial co- 
efficient of correlation means and what it does, since the, device 
is utilized in many standerd tests and is frequently mentioned in 
the literature of resting. At the same time, it is not necessary for 
the teacher to employ the method in order to construct good 
classroom examinations. The difference be:ween a simple count 
of “rights” in selected fractions of the best and poorest pupils 
will suffice as a measure of the validity or discriminative power 
of an item. First, the items should be gone over by several 
teachers, the unsatisfactory items discarded, and the remaining 
items arranged in order of difficulty, this determined by the judg- 
ment of the teachers reviewing the items. Next, the test as tenta- 
tively drawn up is administered to a sample of children drawn 
from the classes or age levels to be tested, From here on the 


steps are as follows: 


1. Arrange the test papers in order for size of total score, from 
the highest to the lowest. i 


2. Count off the 25 per cent* of the best papers and the 25 per 
cent of the poorest papers. If the total group is small (for ex- 
ample, under fifty) take some larger proportion, say the upper 
half and the lower half. Suppose there are eighty pupils in the 
experimental sample (try-out group), so that twenty, or 25 per 
cent, fall in the high group and twenty in the low group. Each 
item may now be examined to see whether it is able to separate 


these two criterion groups. 
3. Determine the number in each of the two criterion groups 
Who answer each item correctly. If fiftcen in the high group 


® Unless the biserial ғ method is used in determining validities, there is no 
need to observe the somewhat unwieldly 27 per cent rule; 25 per cent or any 
convenient larger percentage will serve. 
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answer amitem correctly, and five in the low group get the item 
right, the validity is 15-5, cr 10, and the validity index is 10/20 
or .50.* If all twenty in the high group answer an item cor- 
rectly, and none of the low group gets it right, the validity of 
the item is maximal: 20 — 0 — 20, and the validity index is 20/20 
or 1.00. The lowest validity index of an item by this method is, 
of course, 0/20 or .00. Validity indices run, therefore, from 0 to 
1. There may be a few items of negative validity: more rights in 
the low than in the high group—but such items are rare. Items 
having zero or negative validity must be rewritten before they 
are used or discarded if salvage is impossible. | 


4. If Ru = number right in the high group and Rz = number 
right in the low group, the discriminative power of an item is 
simply (Ra — Rz) or (Ru — Rz) /Nz when written as а validity 
index. Using the same nomenclature, we may write the difficulty 
index of an item as (Rz + Rz,)/ (Na + Nz) in which Ми and Ni 
are the numbers in the high and low groups, respectively. In our 
example above wherein Ru = 15 and Ry = 5, the validity 
index is 10/20, or .50, and the difficulty index is (15 + 
5)/(20 + 20), or .50. If Rx = 18 and Rz = 12, the validity 
index is 6/20, or .30, and the difficulty index is 30/40, or .75. 
Again, if Ru — 10 and Rz — 2, the validity index is 8/20, or 
40, and the difficulty index is 12/40, or .30. 


5. Select the items having the highest validity indices for the 
final test. Then follow the table on page 216 in apportioning 
items to the various levels of difficulty, if the test is to cover a 
fairly wide range of talent. 


6. It is advisable to examine the misleads when multiple-choice 
items are to be used. The method outlined on page 217 will aid 
in locating distractors which are too plausible or not plausible 
enough. The first kind are too often accepted, and the second 
are taken by only a few. : 
* Validities can be left simply as the difference between the number right 


in the two extreme groups. The ch:ef advantage of a validity index is to put 
validities in a percentage scale, as are the difficulties. 
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7. А сага file of acceptable items will prove usefül when a 
teacher wants to lengthen a test or to replace non-functioning 
items. When there are a number of items, a parallel form of the 
test can be drawn up. 

Table 9-1 shows the sort of data which we can expect to get in 
an item analysis of questions administered to a sample of 80, as 
described above. The fuil tabie, of course, would contain data 
on all the items and on all the members of the two criterion 
groups. Half of the scores in the high and in the low groups are 
not shown in order to shorten the table, bu: these omitted Scores 
are included in the totals upon which the item analysis is based. 
Each of the two criterion groups (the high and the low) consists 
of twenty examinees. 

Examination of Table 9-1 shows that Item 4 is highly valid and 
that Items 1, 2, and 5 are acceptable. Item 3 has no validity and 
must be dropped or changed drastically. An item with a validity 
index of'.20 or more may be considered satisfactory—at least 
tentatively. This figure is arbitrary, howcver. If the test is 
shortened, the acceptable point for a validity index should be 
raised; if the test is lengthened, it should be lowered. Any item 
with an index larger than 0 has some validity and hence some 
value. Note that in Table 9-1 the difficulty indices range from 
-70 (a fairly easy item) to .30 (a fairly hard item). 


SCORING THE COMPLETED TEST 


If the completed test is cast in T-F form, the point scores will 
be simply numbered right, or R — W if we wish to correct for 
guessing (page 191). In multiple-choice tests, the correction for 
guessing is 

W 
Score = К — GED 
where 2 = the number of choices or options. It is sometimes ad- 
visable to use the correction for guessing with T-F items, but 
number righr without correction is satisfactory in multiple-choice 
When four or five options are provided. 


с { E TABLE 9-1 


Item Analysis of the First Five Items of a Test Made Upon Two 
Criterion Groups, the Highest and the Lowest 25 Per Cent 
in Total Score. Ny = №, = 20, and N = 80 


Highest Group .. Total test score 
(Best 25 per cent) in order ITEMS 
In order of Merit of size 1 2 3 4 5 
1 72 У У v M У 
2 70 0 У 0 У 0 
3 68 V M ¥ ۷ У 
4 65 ۷ 0 0 v 0 
5 65 У У У У У 
6 65 У У У M ۷ 
7 63 ۷ % 0 ۷ 0 
8 61 У ۷ У У v 
9 60 v v 0 У 0 
10 60 У У У У У 
20 54 0 У 0 У 
Rg = = 15 16 10 20 8 
Lowest Group 
(Poorest 25 per сей!) 
1 35 ۷ ۷ ۷ ۷ 0 
2 34 0 ۷ ۷ 0 0 
3 30 M M ۷ M 0 
4 30 ۷ 0 0 0 0 
5 27 0 0 ۷ 0 0 
6 25 У 0 ۷ ۷ У 
7 25 0 0 0 У У 
8 24 У 0 / M 0 
9 23 v ۷ 0 0 У 
10 23 0 0 0 У У 
20 12 0 У у 0 0 
Кү, = r 8 10 10 8 4 
ReRe i мос 4 
(Ra — Re) / Nu = :35 30 Ü a 60, 20 
(Ra + Rz) / (Na + №) = 58  .65 50 70 30 
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In most cases, it is sufficient for the teacher to expres» standing 
on the test in point scores or totals. If several classes have been 
tested and it is desirable to compare their performance, percentile 
ranks will be useful. Scaling of teacher-made tests in standard 
Scores or normalized scores is not recommended unless the test 
is to be used throughout a school system. 

Directions for the final. test should be explicit, and time limits 
should be given. Manuals for standardized tests may be consulted 
with profit for pointers on, directions and time limits. A test 
should not be so long that most students cannot finish in the time 
allowed. 

"The use of scoring stencils will speed up marking when many 
papers are to be examined. In T-F tests, a strip containing the 
answers (a key) may be laid alongside the left-hand margin and 
the answers checked as right or wrong, or simply the right 
answers checked. Separate scoring sheeis are useful in dealing 
with multiple-choice and matching items. Spaces are numbered 
on the answer sheet for recording answers to the questions on the 
test. The test blank itself is not marked and may be used more 
than once. 


THE RELIABILITY OF THE COMPLETED TEST 


Perhaps the easiest method of estimating the reliability of a 


teacher-made test, since there is rarely more than one form, is” 


by what is called the “split-half” technique. In this procedure, 
the test is administered only once-to a sample of examinees, and 
is then divided into two half-tests. The first half-test contains 
the odd-numbered items (1, 3, 5, and so on) and the second half- 
test the even-numbered items (2, 4, 6, and so on).* The correla- 
tion between scores on the two half-tests is now found and 
from this r the correlation of the whole test with itself (its self- 
correlation) is predicted by the well-known “prophecy 


* Note that when a test is split into odd and even items, the range of diffi- 
culty in the two half-tests is the same and the split is unique. Not just any split 
into two half-tests is satisfactory. 
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formula.f To illustrate, suppose that in a diss of ten seventh 
graders, an English Literature test in multiple-choice form has 
an odd-even correlation of .50. What is the probable self- 
correlation of ‘the whole test? The prophecy formula is 

2 X r (half-test) 

1 + r (half-test) 


Substituting r = .50 for the self-correlation of the half-test, we 
have that 


r (whole test) = 


2X .50 or .67 
1+.50 4 


This is a satisfactory reliability coefficient (.67) for a single class. 
For standardized tests administered to very large groups of a wide 
range of talent, reliability coefficients will ordinarily be higher— 
-90 or more. For teacher-made tests, however, the reliability co- 
efficients will rarely be more than .60 to .70. Reliability is higher 
over several grades—that is, when the test is given to more than 
one grade. The standard error of a test score can be computed by 

„the formula given on page 29, but for the teacher-made test 
this is often a needless refinement. 

Reliability coefficients for a teacher-made test should always 
be computed from a new class, never from the sample used in 
determining the validities of the test items. Self-correlation in 
the standardization group will always be spuriously high, because 
the selection of items was based on the scores of the high and 
low members of the sample. 


r (whole test) = 


VALIDITY OF THE COMPLETED TEST 


A teacher-made test in physics or French, for example, will 
always have content validity, even when the sampling is quite 
narrow. Teacher-made tests rarely cover as much material as do 
the standard printed tests. An approximate measure of validity 
for a test can be found by correlating test scores against’ school 


* The Spearman-Brown prophecy formula is treated in all -standard texts 
dealing with statistical method in psychology and education. 
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grades in the same subject. This method is not entirely satis- 
factory, since school marks are rarely more dependable measures 
of the subject matter than are the tests. When experimental 
validity is attempted by correlating scores on a teacher-made test 
with grades or with other test scores, a new group must always 
be utilized. Such validation, called cross validation, is necessary 
because the group used ір item analysis is a special group which 
has served to select the items in the first place. Cross validation is 
necessary also when the two criterion groups (the upper and 
Jower extreme groups) are selected on the basis of school grades. 
A teacher-made tes will of necessity correlate with grades 
achieved by this group, since the group selected the items. 
Perhaps the best way to judge the value of a teacher-made test 
is by its predictive validity. If the test aids the teacher in getting 
a better notion of the individual differences within the class, and 
leads to better understanding of the difficulties of the students 
(meager knowledge, wrong knowledge, and so on), it has ful- 


filled its purpose. 
SUGGESTIONS FOR FURTHER READING 


Bean, K. L. Construction of Educational and Personnel Tests. New 
York: McGraw-Hill, 1953. 
Noll, V. H. Introduction to Educational Measurement. Boston: Hough- 


ton Mifflin, 1957. 
Ross, C. C., and Stanley, J. C. Measurement in Today's Schools (3rd 


edition). New York: Prentice-Hall, 1954. 
Travers, R. M. How to Make Acbievement Tests. New York: Odyssey 
Press, 1950. 


SUGGESTIONS FOR LABORATORY WORK 


1. Assume that you have tried out 50 T-F items on a class of 40 pupils. 
Draw up a table like that of Table 9-1 showing how you would carry 
out an item analysis. x 

2. If time allows, construct a test using your class as standardizing sam- 
ple. Multiple-choice items in arithmetic and vocabulary taken from 
E. L. Thorndike's Tbe Measurement vf Intelligence may be used con- 
Veniently: Thorndike’s book gives items by levels over a wide range of 
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L 


difficulty Administer a test of about fifty items and item-analyze the 
results by the method given on pages 219-223. 

3. "Take a test which has been given to this class or to some other class. 
Analyze «һе questions for validity, following the method on page 222. 


QUESTIONS FOR DISCUSSION 


l. A sixth-grade teacher has administered a test in fundamentals of 
arithmetic. What analyses of the test data Gould this teacher make which 
would (a) help his future teaching, (b) be of value to individual pupils? 

2. Under what conditions would it be profitable to correct scores on 
a multiple-choice test for guessing? 

3. In some schools, one teacher makes all the, examinations in a given 
subject. What are the advantages and disadvantages of this procedure? 

4. What might an item of negative validity mean? Of zero validity? 


L3 


СНАРТЕК 10 


SOME PROBLEMS IN THE EVALUATION 
OF TEST SCORES 


Interpreting Multiple Aptitude Test Scores 


Table 10-1 gives the scores achieved by ten ninth graders on 
the Differential Aptitude Tests (DAT). Scores on any mental 
test are more meaningful when supplemented by the pupils’ 
school grades and by a knowledge of personality traits, interests, 
and ambitions. With this proviso in mind, it will be interesting 
to answer the questions below with references only to the per- 


centile ranks in Table 10-1. 
QUESTIONS ON TABLE 10-1 
1. Which two students show the poorest scholastic ability? In 
What jobs might they do best? 
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2. Which student exhibits the most consistently high level of 


— ability? 


3. Which students are likely to have reading difficulties? 

4. Which girl should do well in secretarial work? ~ 

5. If Joe Kramer wants to go to college, would you encourage 
him to plan to go into engineering? 

6. Would you encourage Larry Edward: to go into his father's 
accountancy firm after graduatioa from high school? 

7. Jane Goodrich plans to become a medical technician. Would 
you recommend this vocational goal? 

8. Which students will probably find it hard to graduate from 
high school? . 

9. Frank Seay's father is an auto mechanic and Frank is inter- 


‚ ested in this work. Do you think it a wise vocational choice? 


10. Is it likely that several students are handicapped by poor 
spelling and language usage? Why? \ 


Case Studies in Evaluation of Abilities 

The three case studies which follow provide considerable data 
about three pupils, two in high school and one in elementary 
school. Questions are planned to focus upon things to look for in 
evaluating the promise of the pupils being considered. 


I. Case Study of Robert T. 

Robert is 16-2, a sophomore in high school. He is well-grown 
and makes a good appearance. He is well behaved, quiet, inter- 
ested, though not as a participant, in sports, and does not read 
much. Robert’s father is a house painter; both parents are high- 
school graduates. Robert wants to go to college, and is encour- 
aged to do so by his parents. He wants to be an engineer. 


School Data 
Ninth Grade Tenth Grade (First term) 
English С English С 
Social Studies B Social Studies D 
Mathematics B Physics B 
General Science B French D 
Physical:Education C Physical Education. C 
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Test Data’ 


Otis Quick Scoring (Form Gamma) IQ 112 
California Mental Maturity (Language) IQ 110 
California Mental Maturity (Non-language) IQ 121 
Cooperative General Achievement Test: Percentile Ranks 
I. Social Studies 38 
II. Natural Science 52 
III. Mathematics 36 
Kuder Preference Record (Vocational) Percentile Ranks 
Mechanical 63 
Computational 51 
Persuasive 15 
Artistic 12 
Literary 46 
Musical 51 
Social Service 20 
Clerical | 50 « 
Scientific 72, 


1. What do you think of Robert's chances of succeeding in 
college? 


2. Robert's interests are in the mathematics-science area; are 


they strong enough for him to plan engineering as a vocation? 

3. Robert’s school record is weak in English and Social Studies, 
and his interests do not lic in persuasive and artistic fields. What 
occupations would you encourage him not to enter? 

4. Is the variation in Robert’s IQ's too great to arise from 
chance? 
` 5. How do you interpret the difference between Robert's 
language and non-language IQ’s? 

€. Would you say that Robert's school grades are not in keep- 
ing with his IQ? 


7. Do you think Robert might be more successful as a tech”, 


nician than as an engineer? 1 
8. Would you recommend that Robert become a salesman? 
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9. Do Robert’s interests jibe with his achievement test records? 
With his school marks? 
10. Might Robert do well as an airplane pilot? er^ 


II. Case Study of William S. 

William is 18-1, a senior in high school. He makes a good 
appearance, is husky and muscular. William is easy-going and . 
affable; he likes to hunt and is interested in, and good at, sports. 
His father is a successful lawyer and his mother is a college 
graduate interested in club activities. The parents have planned 
for William to study medicine: his grandfather was a well- 
known physician in the community. William has accepted these 
vocational plans but says he is more interested in business and 
sales work. 


School Data 
Tenth Grade Eleventh Grade 
English B English C 
Social Studies B Social S:adies Б] 
Mathematics D Mathematics D 
Physics С Spanish B 
Physical Education B Physical Education В 
Test Data 
Terman-McNemar Test of Mental Ability IQ 118 
California Achievement Tests (Advanced) Percentile Ranks 
Reading 65 
Mathematics 40 
Language 60 
Differential Aptitude Test (Tenth Grade) Percentile Ranks 
Verbal Reasoning 86 
Numerical Ability 42 
Abstract Reasoning 38 
Space Relations 40 
Mechanical Reasoning 32 


Clerical Speed and Accuracy 55 


\ 
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‘Language Usage—Spelling 75 
Language Usage—Sentences 93 

Кибет Preference Record (Vocational) Percentile Ranks 
Outdoor 96 
Computational 32 
Persuasive 90 
Artistic 40 
Literary | 86 
Musical 54 
Social Service 36 
Clerical 40 

* Scientific 26 


1. Do you think that William is college material? 

2. Would you encourage him to plan for medicine as a career? 

3. Do William's grades verify his DAT scores? 

4. Is language a strong arca for William? Would you on the 
strength of this, sugges: some other vocation than medicine for 
William? 1f so, what? 

5. What are William's strong interests, as revealed by the 
Kuder Record? 

6. Are William's achievement test scores in line with his 
school grades? 

7. Do you think William might be happier and more success- 
ful in business? Or in the study of law? Give reasons for your 
answers. 

8. William's IQ does not jibe with his DAT scores. Can you 
give any reasons why this should be so? 

9. The Kuder scores are more helpful than the DAT in 
counseling William. Would you agree with this judgment? 

19, How would you explain to William’s father his consider- 
able variability in scores? And how would you explain the 
apparent contradictions? : 


III. Case Study of Mary S. - 4 
Mary is 11-8, іп the second half of the sixth grade. She 15 
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jà pleasant and well mannered, but is judged by her teachers to be 
"nervous" and overanxious. Mary wants to be a teacher. Her 
father is an auto salesman, with high-school education; her 
mother is a housewife, with junior-college trafhing. There are 
three other children in the family. 


| School Data 
| Fiftb Grade Sixtb Grade 
Reading С Reading B 
Social Studies c Social Studies С 
Arithmetic B Arithmetic С 
Science С Science С 
Я Language С Language B 
Test Data 
Kuhlmann-Anderson Intelligence Tests IO 110 
Metropolitan Achievement Tests Grade Equivalents 
Reading 6.1 
"Vocabulary 6.8 
Arithmetic Reasoning 5.4 
| Arithmetic Comprehension 5.2 
English 6.2 
| Spelling 6.6 
í History 5.6 
Science 4.6 


1. In what subjects is Mary weakest? 
2. Would you encourage her to plan for teaching as a career? 
. Is Mary college material? Give reasons for your opinion? 
- Could Mary do office and clerical work successfully? 

5. Would it help to have a Stanford-Binet IQ for Mary? Give 
reasons for your answer. 


Aw 


Sociometric Testing 


From observations in school апа out, most teachers get a fairly 
good idea of the social and personal relations within their class- 
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rooms. They soon come to know which children are leaders, 2 


which аге well liked and popular, which аге disliked or ignored, 
and which are picked on and teased. It is sometimes valuable for 
a teacher to have, in addition to his own opinion, some measure 
of the attitudes and feelings of the pupils regarding each other. 
When data сЁ this sort are collected systematically, they may be 
put into a table or expressed in the form of a sociogram. This 
last is a pictorial or graphic representation of the interpersonal 
relations within so.ne specified group, often a class. 

The usual procedure is to ask the pupils to designate the class- 
mate by whom they would rather sit, or the child (or children) 
with whom they would prefer to play ball at recess, or to make 
some other choice of a companion in a real life situation. Table 


e TABLE 10-2 
Sociometric Tabulation 


_ CHOSEN 
David Anita Sally Gary Karl Janet Jack Helen “Laura Ruth 


David EL 

Anita I* 
Sally 1 
Gary 

Karl 

Janet 

Jack 

Helen 

Laura 

Ruth 


"Ist 
Choices 
2nd 
Choices 
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10-2 shows the responses made by ten fifth-grade pupils when 
asked to nominate their first and second choices of a child to 
Work with on a class project. (The table reproduces. only part 
of the data for a class.) 

A first choice is shown by a 1 under the name of the child 
chosen, a second choice by a 2. An asterisk (*) denotes that the 


FIGURE 10-1 Sociogram for 21 Kindergartners, 13 Boys and 
8 Girls А 


Group -Kindergarten 
Number-21 ` 
Boys -13 Jeff 


Strong (3) choices ——+; Reciprocals = 4—4—»; Partial reciprocals = «——» 


From Northway, Mary L., and Weld, Lindsay, Sociometric Testing. Reproduced 
y permission of the University of Toronto Press. 
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choice for-first place was mutual. Thus, David chose Gary and 
was chosen by Gary, and Anita and Janet each named the other 
as first choice. The summary at the bottom of the table shows 
David to be the most chosen child, with three firsts and one 
second. Janet is the next most popular, with three firsts. Sally is 
chosen by no one, and three of the girls (Helen, Laura and Ruth) 
and one boy (Jack) receive no first choices. Tabulation of the 
responses as given by the children will provide the "choice" 
information thar the teacher wants. 

A. more striking method of representing the social relations 
within the group is afforded by the pictorial. sociogram shown in 
Figure 10-17 The responses were those of twenty-one Kinder- 
garten children—thirteen boys and eight girls. The stars (pop- 
ular, often chosen children) are quickly located as are also the 
isolates, whom no one chooses. The two-headed arrows indicate 
mutual choices. ЙЕ 

When used wisely, а sociometric test сап be helpful to a 
teacher, especially when the class is too large for close personal 
observation. Some of the things which a sociogram may reveal 
are the following: 


1. Good and bad personal relations, free interchange of 
choices, or the existence of cliques. 


2. Clusters and cleavages resulting from differences in race, 
religion, sex, and economic conditions of families. 


3. Differences between in-school and out-of-school social 
groupings. 

The sociometric method has some disadvantages and may do 
more harm than good if the morale of the class is low because © 


podr discipline, frequent change of teachers, or other disrupting 
influences. For example, choices may be trivial or deliberately 


hostility and resentment against other pupils or against the 
teacher. Moreover, the choices of young children aze often fleet 
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ing, vary from time to time, and are quite unreliable. The socio- 
gram, therefore, is not foolproof. At the same time, in the hands 
of a skillful teacher, sociometric testing will often provide new 
insights into the personality traits of pupils and thus aid in 
discipline and in remedial work. 
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APPENDIX А 


STATISTICAL SUPPLEMENT 


In order to understand and use test results wisely, a teacher 
should be familiar with those statistical concepts most often em- 
ployed in mental testing. One of the best ways to accomplish 
this is to work through the computation of the basic statistics. In 
Chapter 2 a number of statistical terms were defined and their 
application illustrated. In subsequent chapters these statistics have 
been frequently employed. If, when a statistic is first mentioned, 
the student will work through its derivation—for example, the 
tabulation of a frequency distribution or ihe computation of 
an r—the value of the statistic to mental testing will be clarified. 
А second or even a third review is often helpful. A good analogy 
here is the habit of looking up unfamiliar words in a dictionary. 
Sometimes a word must be looked up more than once before its 


meaning is clearly grasped. 
This Appendix deals with the following topics: 


The Frequency Distribution Д 

The Frequency Polygon and the Histogram 
Averages: Mean, Median, and Mode 

Measures of Variability: Range, О, and SD (о) 
The Coefficient of Correlation 


Drawing up a Frequency Distribution 


Test scores are more readily dealt with when they have first 
been organized into a frequency distribution. Suppose that Miss 
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Norton hrs administered a standard test to her class of ferty £ 
pupils in social studies, and that scores are as follows: s 


© 37,.38, 36, 31, 28, 33, 24, 19, 25, 34 
16, 43, 22, 20, 26, 44, 27, 19, 25, 34 
33, 24, 22, 20, “4, 27, 31, 28, 38, 17 
31, 26, 34, 17, 19, 20, 22, 24, 26, 29 
"Table A-1 shows these forty scores tabulated into a frequency 


distribution in which the interval is five score units. Steps in 
setting up a frequency distribution follow: 


TABLE А-1 
Frequency Distribution of Forty Scores on a Social Studies Test. 


Intervals Midpoints Tallies [4 

40-44 42 Ш 3 
35 — 39 37 ШІ 4 D 

30 - 34 32 TH III 8 

25 — 29 27 TER THLE 10 

20 ~ 24 * 22 TIL | 9 

15-19 17 THE I 6 
40 l 


(1) Determine the range, or the gap between the highest E 
and lowest scores. Examining our set of forty scores, we find the 
range to be from 44 to 16, or 28. 

(2) Select an interval which will be convenient for tabula- 
Чоп. A good working rule is to take a grouping unit which will 
yield from five to fifteen intervals. This rule may have to be 
broken when the sample is very large (200 or 300, say) or very 
small (less than 25). 

(3% Divide the range by the interval size tentatively chosen. 
This gives the approximate (within one) number of intervals. 
In Table A-1 the range of 28,divided by five gives 5.6, and the 
number of intervals is six. Five is a better choice than is a larget 
or smaller unit. For example, an interval of three will spread the 


MR 
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data out too thin (into ten intervals), whereas an interval of ten 
crowds all forty scores into three intervals. 

(4) Write the beginning and end of each interval as a score: 
for example, 15-19. Actually a score of 15 represents the interval 
from 14.5 to 15.5—that is, a distance along an ability scale; and 
19 represents the interval 18.5 to 19.5. Hence, the lowest interval 
begins at 14.5 and ends at 19,5; the second interval begins at 19.5 


# and ends at 24.5, and so on. Writing score limits instead of actual 


limits saves time and avoids the confusica which often arises 
when one interval ends and the next begins with the same figure. 

(5) Tally each score under its proper interval as shown in 
Table А-1. Then write the sum of the tallies opposite each 
interval under f (frequency). Sum the f's to give N. 

Note that the midpoint of the topmost interval is 42—that is, 
2.5 from 39.5 and 2.5 from 44.5. The midpoint: have been 
entered in the second column. When scores have bed: arranged 
into a frequency distribution, all of the f's within a given interval 
are represented by the midpoint of that interval. 


The Frequency Polygon 


Figure A-1 shows the frequency polygon of the forty scores 
tabulated into a frequency distribution in Table A-1. Two axes, 
a horizontal or X-axis and a vertical or Y-axis, are drawn at right 
angles. Score intervals are laid off at regular distances along the 
X-axis, or baseline, beginning with 15, the lower limit of the 
first interval. The six scores on the lowest interval are represented 
by a point six units up on the У. -axis and just above 17, the mid- 
point of interval 15-19. The nine scores on the next interval 
are represented by a point nine units up on Ү and just above 22, 
midpoint of the interval 20-24. The other f's are drawn in 
in the same manner. б 

When all of the poirs are joined by short straight lines, we 
have the outline of the frequercy polygon. То complete the 
figure—that is, to bring it down to the baseline at each end— 
two intervals are added, one (10-14) at the low end and other 
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FIGURE A-1 Frequency Distribution of tbe Forty Scores in 


Table A-1 
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(45-49) at the high end. The f on each of these intervals may 
be taken as 0, and hence the frequency polygon reaches the 
X-axis at 12 and 47. 

In order to provide a symmetricai figure—one which is neither 
too squat nor too thin—units in X and У must be carefully 
chosen. A good rule is to select units which will make the height 


of the figure about 2/3 of its length. In Figure A-1 the maximum . 


f (10) is about 2/3 the baseline length of the polygon. 


The Histogram 


The frequency distribution of Table A-1 is again represented 
in Figüre А-2, this time by a histogram, or column diagram. 
The main difference between the frequency polygon and the 
histogram is that in the histogram the f's are represented by 
„small rectangles whose height equals the f's on the intervals. In 
Figure A-2, for example, the height of the first rectangle is 6, 
its width being the length of the interval 14.5 to 19.5. Each 
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2 Histogram of tbe Forty Scores in Table 4-1 
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frequency rectangle begins at the actual lower limit of the inter- 
val and ends at the actual upper limit. The histogram presents 
the same facts as the frequency polygon and there is often little 
to choose between them. When two or more frequency distribu- 
tions are represented on the same axes, however (as for example, 
the scores of two classes or two sections of the same class), the 
frequency polygon is to be preferred to the histogram, because 
the vertical and horizontal lines in a histogram coincide and are 


often difficult to disentangle. \ 


COMPUTATION OF AVERAGES 


There are three averages in common use: the mean, the median 


and the mode. 


The Mean (M) 

We have defined the M on page 20 as the statistic found by 
dividing the sum of. the scores by their number. When scores 
are put into a frequency distribution, the scores classified within 
any interval lose their identify and are represented by the mid- . 
point of that interval. This necessitates а slightly different pro- 
cedure from that used with unorganized scores. 
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In Table, A-2, section A, the midpoint of each interval is 
multiplied or “weighted” by the frequency which lies opposite 
it in the f column. This gives the fX column and the sum of 
this column (1100 in Table A-2) divided by N (40) gives a 
mean of 27.50. The formula is 


ae — کل‎ 
mi um 
TABLE A-2 


Computation of the Mean from a Frequency Distribution 
Data are tbe forty scores in Table A-1. 


A. LONG METHOD 


Intervals Midpoints f fX 
40 - 44 42 3 126 
33 =39 37 Я 148 
30:= 34 i 32 8 256 
25 = 29 a 27 10 270 
20.— 24 22 9 198 
IS = 19 6 


B. ASSUMED MEAN METHOD (SHORT METHOD) 


Intervals Midpoints 


- 
x 
= 
X 


40 - 44 42 3 3 9 
35 — 39 37, 4 8 
30 = 34 32 8 8 
25 — 29 0 


exe. cud = 27.00 + .50 
ў = 2750. 


E4 
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where XfX is the sum of the products f X X and N is the 
number of cases. 

M can always be computed by the “Long Method” just de- 
scribed, but it is generally computed by the Assumed Mean, or 
“Short Method.” When N is large, the Short Method reduces 
calculation and saves time. Moreover, the Short Method is man- 
datory when standard deviations and coefficients of correlation 
are later to be computed from the same data. Computation of М 
by the Assumed Mean or Short Method is shown in Table A-2, 
Section B. Steps are as follows: 

(1) Assume a mean, called the АМ, near the center of the 
frequency distribution and if possible on the interval having 
the largest f. In our example, the AM is taken at 27, midpoint of 
interval 25-29, and this interval aiso has the largest f. 

(2) In the column a * lay off deviations from the AM of 27 
in units of interval. The midpoint of inter val 30-34—that is, 32— 
deviates five scores or one interval from 27; and the midpoint 
of interval 35-39 deviates two intervals from 27, and so on. 
Below the AM, the deviations of the midpoints of the two inter- 
vals—22 and 17—are —1 and —2. The midpoint of the interval 
25-29—that is, 27—15 the assumed mean, and 0 is entered in 
the x’ column opposite this interval. 

(3) Multiply each a^ by its f.and enter the product in the fa^ 
column. The sum of this column is 4—25-21—from which the 
correction (c) is calculated. The forrnula is 

g= f 
TN 
and с = 4/40 өг .10 in our problem. 

(4) Multiply c, the correction in units of interval, by the 
length of the interval or i, to give ci, the correction in score units. 


In our example, ci = .10 X 5 = 50. 
(5) Add the correction, ci, to the AM to get M. In Table A-2, 


* w denotes the deviation of a midpoint from the AM; that is, 
AM. Deviations from М are denoted by х. 


х = Марг. — 
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section B, the M — 27.00 4- .50, or 27.50, thus checking the 
computation in A above. 


The Median, or Mdn 


The median is defined as that point in the distribution below 
which and above which lie 50 per cent of the distribution. The 
median is also described as the fiftieth percentile (Pso) and the 


TABLE A-3 


Computation of the Median and Q from a 
Frequency Distribution . 
Data are the forty scores in Table A-1 


Intervals f 

40 — 44 3 

35 — 39 4 

30 — 34 `8 
*25 — 29 10 25 
is - Am: 
6 6 

N= 40 


N/2 = 20 N/4 = 10 3N/4 = 30 


| 20 — 15 
By formula, Median = 24.5 + (P) 


= 27.0 
30 — 25 
By formula, Оз = 29.5 + (=) 


= 32.63 


10 — 6 
By formula, Qı = 19.5 + ( ) 


= 2172 
32.63 — 21.72 
D Q [ a 
2 
= 5.46 


LIE. 00^: TT n Lon - 


j| 


3] 


J 


1 
| 


E SSC 


Substituting in the formula, we have 
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second quartile (Qs). Computation of the median іп a fre- 


` quency distribution is shown in Table А-3. (The О, or quartile 


deviation, which is found in the same way as the median, is 
also computed in the table.) Steps are as follows: 

(1) Take 4 of N and count into the distribution from the 
low end until the interval containing the median is reached. In 
Table A-3, N/2 = 20, and counting into the distribution from 
interval 15-19, we locate the median on interval 25-29. The 
two lowest intervals contain 6 4- 9 or 15 f's, and it is clear from 
this cumulated f that the twentieth score must fall on interval | 
25-29. Е 

(2) Apply the following forn.ula: 

Mdn =1 + i (2 > еш л 
т 
in which 7 
1 = lower limit of interval on which Mdn lies 


N/2 = V, of the number of scores 
cpm fi = sum of scores on intervals below 1 
= he i 1 ining the Md; 
» = frequency on the interval containing the Mdn 


i length of the interval 
In our example in Table A-3, | = 24.5, lower limit of interval 
containing Mdn; N/2 = 20; cum fi = 15; fm = 10; і = 5. 


20; = >) 


Mdn = 24.5 + :( T 


= 27.00 
The median can be found by counting into the distribution from 
either end, but it is generally easier to start at the low end. 


The Mode 

'The mode is usually taken as the midpoint of the interval 
which contains the largest f. In Table A-3, the mode is simply 27, 
the midpoint of the interval 25-29. This “midpoint” mode is 
often called the crude mode. The mode may be calculated more 


me 
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accurately; but since it is usually a preliminary statistic it is 
hardly worth while to do so. 


COMPUTATION OF MEASURES OF VARIABILITY 


The means or medians of two distributions are often the same 
or nearly so, but she spread or scatter of the scores around the 
central point is quite different. One class, for example, may 
show the same mean but a much greater range of talent than 
another. Knowing the variability of performance within a class 
may be more %seful than knowing its average or typical per- 
formance. А 

"There are three measures of variability all of which are used 
in mental testing: the rage, the О and the SD (o). 


The Range 


The range is the gap between the smallest and largest scores. 
The range is a useful statistic, but is often a rough measure. It 
is least efficient when there are several outstanding scores— 
either very large or very small. For example, suppose there is a 
gap of 20 points between 75, the highest score, and 55, the score 
next below it. Then if the lowest score in the set is 25, the single 
outstandingly high score will increase the range from 30 to 50. 
We had occasion to find the range in constructing the frequency 
distribution (page 240). 


The Q, or Quartile Deviation 

Q, the quartile deviation, is defined as one-half the distance 
between the seventy-fifth and twenty-fifth percentile points in 
the distribution. To find these two percentiles, we must count 
into the distribution as we do to find the median. In Table A-3; 
for example, we count off % of N to get Qs (the third quartile 
or seventy-fifth percentile) and 14 of N to reach Qi (the first 
quartile or twenty-fifth percentile). The formula for Qs is 

сте: Te - cum m 
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and the formula for Qı is 
Gale (ea * = Ej 
in which 
1 = lower limit of the interval upon which the 
quartile point falls 
i = the interval 
cum fi = cumulated f's up to the interval containing 
the quartile wanted 
fm = fon the interval which coritains the quartile 
In Table А-3, 34 of N 5 30. Counting into the distribution 
from the low end, twenty-five scores take us to 29.5, lower 
limit of 30-34, which is the interval containing Qs. The f on: 
this interval is 8. Substituting in the formula, we have 


30 — 25 
Qs = 29.5 + (25) 
32.63 


To obtain О, we count off %4 of N or ten scores as shown 
in Table A-3. Six scores take us to 19.5, lower limit of the 
interval 20-24, the interval which contains Qi. The f on this 
interval is 9. Substituting in the formula, we have 

: 10 — 6 
Qı = 19.5 + ( 9 ) 
= 21.7? 1 
From the two quartile points, Qs and Qı, we find Q by sub- 
stituting in the formula 


Qe 


32:63 — 21.72 
and in our example, О = (29 нт) ог 5.46. 


(Qs — ©) 


Тһе Standard Deviation, SD о" с (sigma) 


y . те 2 
The standard deviation, ог ©, is a measure of variability com- 
pated around the mean; hence it is usually calculated fron. tne 
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same frequency distribution as the mean. SD, or с, is the most 
stable measure of variability within a group and so is regularly 
used in research problems which involve correlation and in- 
ference. The computation of o from a set of ungrouped scores 
‘was outlined on page 22. Calculation of the SD from а fre- 
quency distribution requires a somewhat different procedure. 
The method is illustrated in Table A-4 for the same forty scores 
tabulated in Table A-1. Steps are as follows: 


TABLE A-4 


Computation of the Standard Deviation (0) from a 
Frequency Distribution 
Data are tbe forty scores'in Table A-1. 


Intervals f x fx’ fx’? 
40 — 44 3 3 9 27 
35 — 39 4 2 8 16 
30 - 34 8 1 8 8 
25 – 29 10 0 +25 с 
20-24 `9 —1 —9 9 
15 — 19 6 —2 —12 24 
N = 40 —21 84 
m ж 
AM = 27.00 UEM a m 10 c?= 01 
м 4 
2 84 
ep ә = 58 — 01 5x 1.446 
N 


11 
м 
N 
بي‎ 


(1) Find the deviation (2^) of each midpoint from the AM, 
as was done in Table A-2. Enter these figures as 1, 2, 3, 0, 
—1, —2—that is, in units of interval in the x column. Р 

(2j Multiply each x by its f to give the entries in the f% 
column. 


(3) Multiply each x’ and its corresponding fx’ entry to give f 


the entries in the fx’? column. For example, 2 = 3 times fa^ = 
gives 27 as the fx“ entry. 


p 


y 


D, 
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(4) Sum the fx? column to give Sfx”. 

(5) Compute the correction (c) as in Table А-2. Square c 
to get c*. Be sure that c is left in units of interval. 

(6) Find о from the following formula: 


emi ERIT? 


In our example, i = 5, Sfx’? = 84, N = 40 and c? = .01. 
Substituting these values in the formula, we get o = 7.23. 1: 
will be clear that in computing c we make use of the same 
quantities used in finding the mean; only the fx” is new. 


CORRELATION 


Correlation (page 27) is the correspondence or relationship 
between two sets of test scores. Degree of relationship is ex- 
pressed by a coefficient of correlation (7) along a scale which 
extends from —1.00 to +1.00 through .00. There are several 
methods of computing correlation, of which the product-moment 
method is the most often employed in dealing with test scores. 
Calculation of a product-moment v is illustrated in Table A-5. 

Table A-5 shows the computation of the correlation between 
test scores in reading and arithmetic achieved by ten children 
in the fifth grade. The sample is much too smal] to give an ade- 
quate indication of the relationship between these two variables, 
and our table must be taken as a much simplified illustration of 
correlational method. | 

The coefficient of correlation in Table A-5 is .23, revealing а 
positive but quite low relationship between the two tests. The 
first test (reading) is designated X, and the second test (arith- 
metic) is F. Note that, in order to compute the correlation, we 
must first find the deviation of each child's X-score froin Мх 
and the deviation of his F-score from My. Each deviation from . 
Mx (55) is entered in the x column, and each deviation from 
Му (21) is entered in the y column. Each x and y is then squared 
and entered in the x? and у? columns, and the sums of these two 
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TABLE А-5 
Correlation between Reading and Arithmetic in the 
Fifth Grade 


(N = 10) 


ج ب 


Reading Arithmetic 


Pupils on ty) چ‎ туу x y? xy 
John 60 26 ^ 7 $ 49 25 35 
Carol 55 24 2 "9 4 9 6 
Апл 63 18 10 — 100 9 —30 
Betty 40 21 -3 0 169 0 0 
Louise 52 17 — i 4 1 16 4 
Tom 61 20 2 —1. 64 1 = 8 
Bill 43 15 —10 —6 108 36 60 
Joan 56 23 3 4 9 16 12 
Dick 44 23 — 9. 2 81 4 —18 
Carl 56 21 3 0 9 0 0 

530 210 Sx? = 586 Sy? = 116 ху =61 

Mx—53.0 My —21.0 
61 
dui zd 


columns are found. In the last. column (xy), the x and y devia- 
tions of each pupil are multiplied with due regard for sign, and 
the sum of the xy column is determined. Finally, the sum of the 
xy column is divided by the square root of the product of the 
3x? and Sy? to give the coefficient of correlation. The formula is 


= Уху 

МТ” = Sy? 
The formula for r may be written in a rumber ot ways. The 
form selected for use will depend on the character of the data, 
size of the sample, purpose of the experimenter, and other соп" 
siderations. Whenever N is more than about 50, the correlation 
coefficient should be computed from a diagram (see references: 


Chapter 2). 


т 
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APPENDIX В 


PUBLISHERS OF MENTAL TESTS 


Teachers who do much testing should write the publishers _ 


below for their catelogs. 

Bureau of Publications, Teachers College, Columbia University, 
New York 27, New York. - 

California Test Bureau, 5916 Hollywood Boulevard, Los Angeles 
28, California. ‘ i 

Educational Test Bureau, 720 Washington Avenue, S.E., Minne- 
apolis 14, Minnesota. . 

Educational Testing Service, Cooperative Test Division, 20 Nas- 
sau Street, Princeton, New Jersey. 

Houghton Mifflin Company, 2 Park Street, Boston 7, Massachu- 


setts. 
Psychological Corporation, 304 East 45th Street, New York 17, 


New York. 
Public School Publishing Company, 509-513 North East Street, 


Bloomington, Illinois. а 
Science Research Associates, Inc., 57 West Grand Avenue, Chi- 


cago 10, Illinois. | . 
C. Н. Stoelting and Company, 424 North Homan Avenue, Chi- 

cago 24, Illinois. 
Stanford University Press, 
World Bool: Company, 313 Park Hill 

York. 


Stanford, California. 
Avenue, Yonkers 5, New 
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GLOSSARY. 


achievement test А test designed to measure pupil performance 
in some school subject. ' 
age equivalent The chronological age assigne 
score on a test representing the typical (average) age cor 
ing to the score. Example: reading age — 8-4. 

age norm Typical performance on a test expressed in age equiv- 


d to an obtained 
respond- - 


alents. 
alternate forms of a test Equivalent or parallel forms of a test. 


aptitude test A test designed to measure potential ability; spe- 
cifically, a test to predict future success in a school subject or in 
a vocation. 

attitude test A test designed to measure 
given area. Example: attitude towards war. А 

battery A group of tests, often combined into a team, designed 
to measure a variety of abilities or aptitudes. 

A coefficient of correlation often used to measure the 
m in analysis. 

pical of a group of scores; a mean, 


1 
likes or dislikes in a 


biserial r 
discriminative power of an ite 
central tendency A measure ty 
median or mode. 

chronological age (C.A.) 


Thus, 10-4 means 10 years and 4 months. ° 
completion items Test questions in which the examinee must fill 


in blank spaces in a statement or sentence in order, to complete 


the meaning. 

correlation The ten 
lated) то another test. 1 
criterion Any measure of perfo 
compared :n determining validity. 


255 


Life age expressed in years and months. 


dency for one test to be related (or unre- 


rmance with which a test is 
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deviation IG А standard score found by converting raw scores 
into a'distzibution with a mean = 100 and aco of 15 or 16 points. 
diagnostic tests Tesis designed to reveal pupils’ strengths and 
weaknesscs in school subjects. 

discriminating power A test item which separates good from 
poor students has discriminating power. 

distractor An option in a multiple-choice test that is incorrect. 
essay items Test items calling fora rclatively free response. 
evaluation Appraisal of a pupil's performance; may include in- 
-school and out-of-school behaviors. 

frequency distributior- An arrangement of test scores into groups 
in order of size. " 

grade equivalent The grade score assigned to a given obtained 
score on a test. Example: A score of 42 on an achievement test 
may have a grade equivalent of 6.5 ( halfway through the sixth 
grade). ` 

graphic raling scale A rating device in which possession of a 
given degree of some trait is indicated by a check along a line. 
group test А test that may be administered to all members of a 
group or class at the same time. 

individual test A test administered to only one person at a time. 
IQ (intelligence quotient) Originally, the ratio of mental age to 
chronological age when mental age is obtained from an Age 
Scale. Often used loosely to mean any set of scores with a mean 
of 100. See deviation IQ. 

intelligence tests Tests designed to measure intelligence, which 
may be defined as mental alertness or ability to do well in school. 
inventory А test or checklist of a person's personal characteris- 
tics, attitudes, or iuterests. 

item A single question on a test. 

item analysis The process of determining the difficulty and 
validity of test items through statistical analysis. 

matching items Test items in which the members of one list are 
to be matched against the menibers of a second list. 


A 
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mean The arithmetic average of a set of test scores. , 
median The point that "divides a frequency distribution of 
scores into two equal parts. ` 

mental age (MA) The age. for which an obtained „соге on an 
intelligence test is average or typical. 

mode The score which occurs most often in a distribution. 
multiple-choice items Test items which call for the selection of 


a correct answer from among several options. 
normal probability curve А theoretical distribution curve which 


many distributions of test scores approximate. я 
norms Average performances for various groups—expressed as 
age or grade equivalents for school children, as percentiles, and 
in other ways. 

objective test A test answered by checking or circling a number 
or letter. Example: True-false test items. 
options Responses from among which: ап ex: 


a selection. р 
percentile rank (PR) The equivalent to an obtained score on a 


scale of 100 points. Example: If a score*of 86 has a percentile 
rank (PR) of 63, we know that 63 of the group scored below 86. 
personality test A test (often an inventory) designed to assess 


an individual's personal and social behaviors. 
A test designed to measure level of performance 


aminee must make 


power test 


rather than speed. 
profile А graphic device fo 


on several tests. 
projective tests Devices for studying p 


of ink blots, pictures, designs. т 
quartile deviation (Q) А measure of variability. О equals опе- 


half of the range of the middle 50 per cent of scores. | 
questionnaire A systematic inventory of questions covering per- 
sonality traits, attitudes, or interests. : 
readiness test А measure of a child" 


Often used in reading. 


r representing an examinee's scores 


ersonality through the use 


s readiness or maturity level. 
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reliability Сапхенсу оё test scores. ^ 
reliability coefficient Correlation coefficient giving the self-corre- 
‘tation of a test. % 
skewness . The extent to which a distribution of scores is о l 
center or biased. i 
sociometry Measurement of interpersonal relations within ‘a class 
or other group. i 

` split-half reliability Reliability coefficient found by splitting a 
test into halves. The two parts of the test usually consist of odd- 
aud even-numbered items. 
standard deviation (SD‘or с) A measure of variability, 
standard score A converted or derived score found by express- 
ing an obtained score as being so far above or below the mean 
in SD units, J 

‚ © standardized tests Printed tests for which there are norms on 
defined groups. Directions are carefully prescribed. 
test-retest reliability The correlation between scores made on 
the same test administered on two occasions, 
T-score A normalized'score. 
true-false items Test items which the examinee is to mark as 
true or false. 
validity The degree to which a test measures what it purports 
to measure. There are several sorts of validity. 
z-score An obtained score expressed as a deviation from the 
test mean in terms of' c, When z-scores are converted into a 


| A s 
+ frequency distribution with an assigned mean and oc, they arc 
called standard scores. 
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Ability, meaning of, 46 


^ Achievement tests, 106; definition of, 


102; diagnostic uses of, ?16; survey, 
106; teacher-made, 210; value of, 115 

Adjustment inventories, 164 

Age norms, 41 

Age scale, 32; value of, 33 . 

American Council on Education Psy- 
chological Examination (ACE), 91 

Aptitudes, meaning of, 130 « 

Aptitude tests, art, 151; batteries of, 
133; case studies of the use of, 226; 
clerical, 137; how to judge, 154; in- 
terpreting scores in, 155; mechanical, 
133; music, 151; use in professional 
schools, 147 

Army Alpha test, 8, 81 

Army Beta test, 7 

Army General 
(AGCT), 8, 81 

Art aptitude tests, 151 

Arthur Point Scale, 73; use of, in 
schools, 75 

Ascendance-Submissioxz Reaction Study 
(Allport), 171 

Attitudes, 171; questionnaires in the 
study of, 172 ff. 4 


Bell Adjustment inventory, 169 

Bennett Mechanical Comprchension 
Test, 135 

Binet, Alfred, é> characteristics of his 
tests, 6-7 

Biserial ^, in item analysis, 215-216 


Classification Test 


Zalifornia Achievement Tests, 111; 
characteristics of, 113 = 

California Arithmetic Test, 212 

California Test of Mental Maturity, 
84; description of, 85-87 

California Test of Personality, 166-167 


260 СИ 


SUBJECT INDEX 


Central tendency, mezning of, 19 

Clerical aptitude tests, 137 

Columbia Research Bureau Spanish ` 
Test, 211 г 

Coníoining test scores, 34-37 

Completion-test iterns, 202; illustration 
of, 203-204 

Content analysis, 31, 126, 213 

Cooperative French Test, 123 

Cooperative General Achievement? 
Tests, 113-114 

Cooperative Mathematics Test, 122 

Cooperative Science Test, 125 

Correction for guessing, 191; when to 
use, 194 

Correlation, meaning of, 27-28 

Correlation coefficient, computation of, 
251-252 

Criteria, in validity, 154 


Diagnostic tests, clinical, 57-59, 67-68; 
differential, 140-144; educational, 52- 
56, 68, 70-72, 75-77, 93-95 

Diagnostic Tests of Achievement in 
Music, 152-153 ` 

Differential Aptitude Tess (DAT), 
140-144 


1 


Educational achievement tests, 102- | 


103; and intelligence tests, 103; com- í / 


pared with school examinations, 104- | 
106; in school subjects, 118; how 
used in schools, 115-118; what to j 


look for in, 125-128 Jl 


و 


24 


Educational age (EA), 127 

Essay tests, described, 204-205; how to 
improve, 205-206; scoring in, 206- 
207 

Evaluation and Adjustment Series, 122 


Frequency distribution, 14-15; rules for 
constructing, 239-241 


5 


Subject Index 


i requency polygon, 15; how “ con- 
Struct, 241-242 


Galton, Francis, role in testing move- 
ment, 5-6 
3eneral Clerical Test, 139 
Gordon's Personal Profile and Personal 
Inventory, 168-169 
Grade norms, 41 
' Group tests, of intelligence, 80-82; in 
guidance, 93-95; norms in, 98; relia- 
bility of, 97; scaling in, 97; use in 
_ Schools, 92-96 
Yuidance, educational and vocational, 
93-95, 115-117, 229-233 


‘alo effect in ratings, 162 
) istogram, 16-17; how to construct, 
vj 2422243 


dividual differences, importance of, 
| »12 
ıtelligence, meaning of, 45; levels of, 
46-47 
.ntelligence quotient (JQ), Stanford- 
Binet, 51-52; constancy of, 61-63; dis- 
tribution of, 53-54; precautions in 
\ interpreting, 60-61; stability of, 56- 
} 57 
] . atelligence quotient (/Q), Wechsler- 
Bellevue, 65-67; in diagnosis, 67-68; 
1 range of, 67 
|, Intelligence tests, factors in the choice 


A of, 96-100; group, 80-81; individual, 
W 44-45; performance, 72-75 
3 terest inventories, 174-180 


ıwa Silent Reading Tests, 121-122 
IQ (intelligence quotient), 33; as 
standard score, 39; as ratio, 52. See 
intelligence Quotient 
stem analysis, 214-221; short method 
of, 219-221 
, ttem (test), difficulty of, 213; selection 
| of, 212-213; validity of, 214 


'""uder Preference Record, 177-179 
»uhlmann-Anderson Intelligence 

y Tests, 88-89 

T Law School Admission Test, 149 


| MacQuarrie Test of Mechanical Abil- 
ity, 134-135 
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Matching items, 19У; iltustrations of, 
200-202 

Mean, 20; in frequency distribution, 
243-246 

Mechanical aptitude tests, 133-137 _ 

Median, 20; in frequency distribution, 
246-247 

Medical College Admission Test, 148- 
149 

Meier Art Judgment Test, 153-154 

Mental age (MA), 32-33 

Mental tests, classification of, 3-5; com- 
pared with physical, 2-3; history of, 
5-12; uses of, in schools, 12 

Metropolitan Achievement Tests, 109- 
111 

Metropolitan Readiness Tests, 118-120 

Minnesota Clerical Test, 137-139 

Minnesota Paper Formboard Test, 144- 

145 

Mode, 20-21, 247-248 В 

Multiple-choice items, 193-194: illus- 
trations of, 195-197 

Multiple response items, 198-159 | 

Murphy-Durrell Diagnostic Reading 
Readiness Test, 145-146 

Musical aptitude tests, 151-153 


ational Teacher Examination, 150 
iNelson-Denny Reading Test, 212 
Normal distribution, 17; uses of, in 

testing, 17-19 
Normal probability curve, 17; arcas 

under, 23 
Norms, 40; age, 33; percentile, 35; 

standard scores as, 35-38 
Objectives, educational, 105-106 
Objective tests, 80, 105; compared with 

essay examinations, 185-189; item 

types in, 185 
Occupational Iixerest Inventory, 175- 

177 


147 
Otis Quick-Scoring Mental Ability 
Tests, 87-88 € 


Percentile rank, 25-27; advantages of, 
33-36; limitations of, 36; norms in 
terms of, 35 & 

Percentile scale, 33-36 

Performance tests, 72-75 


e 


Orleans ne we Test, 146- . 
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Persoriality, .neauing of, 157-158; in- 
ventories in the measurement of, 
164; rating scales in the measure- 
ment of, 158-163; soc? ometric tech- 
niques in. 233-237 i 

Personality inventories, 164-180; sum- 
mary on the use of, 180-182 

Pintners Aspects of Personality, 167- 
168 

Pintner-Cunningham Primary Tests, 
82-84 

Pre-Engineering Ability Tests, 149-150 

Profiles, use of, in comparing test re- 
sults, 35, 85, 143 

Projective tests, meaning of, 158 


Quartile, meaning of, 24-25 

Quartile deviation (О), calculation of, 
248-249 

Questionnaires, 164 


т, coefficient of correlation, 27-28; cal- 
culation of, 251-252 

Range of scores, 14, 248 

Rating scales, 158-160; factors affect- 
ing, 160-162; improvement in, 162- 
163; summary on, 163 

Reliability of a test, coefficient of, 27- 
29; parallel forms in, 29; split-half. 


technique in, 227-224; test-retest irf 


29 


Seashore Measures of Musical Talent, 
151-152 

Selection of tests, factors in, 96-100, 
125-128, 154-158 

Sequential Tests of Educational Prog- 
ress, 114-115 

Sigma scores, meaning of, 36 

Sociometric techniques, 233-237 

Standard deviation, 22; calculation of, 
in simple series, 22: in a frequency 
distribution, 249-25 


n 


Subject Index | 
d 


Standard error, of а score, 29-30 
Standard scores, 36; computation of, = 
36-38; normalized or T-scores, 40 [ 
Stan?ardized tests, 210-211 Э 
Stanford Achievement Tests, 106-109 
Stanford-Binet Intelligence Scale, 47- 
52; reliability of, 56-57; scoring in, | 
51; uses of, in the schools, 52-60; М 
validity of, 61-63 | 
Strong Vocational Interest Blank, 179- | 
18¢ / 
Study of Values (Allport-Vernon- 
Lindzey), 172-173 


Teacher-made tests, 219 ff. , x 
Terman-McNemar Test of Menta 
“Ability, 89-91 


Test items, varieties of, 185 ff. e T 
Thurstone Temperament Schedul., { 
170-171 
True-False items, 189-190; illustration: 
of, 192-193 Г 
T-score, 40 


Turse Shorthand Aptitude Test, 147 


Validity, of a test, 30-31; of test items, 
214 

Variability, in scores, 21 ; 4 

Verbal ability, and performance abil- 
ity, 64-65, 68, 75-77 


Wechsler-Bellevue Intelligence Scale, 
63-67; in diagnosis, 68; in the schools, 
67-68 

Wechsler Adult Intelligence Scale, 63 

Wechsler Intelligence Scale for Chil- 
dren, 68-69; compared with Stan” 
ford-Binet, 70; MÀ in, 72; range and 
stability of IQ's in, 71 


z-score, 36 
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